date:20141104

sparse x sparse matrix multiplication

2014-11-04 Thread ll

what is the best way to implement a sparse x sparse matrix multiplication
with spark?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/sparse-x-sparse-matrix-multiplication-tp18163.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Best practice for join

2014-11-04 Thread Akhil Das

Oh, in that case, if you want to reduce the GC time, you can specify the
level of parallelism along with your join, reduceByKey operations.

Thanks
Best Regards

On Wed, Nov 5, 2014 at 1:11 PM, Benyi Wang  wrote:

> I'm using spark-1.0.0 in CDH 5.1.0. The big problem is SparkSQL doesn't
> support Hash join in this version.
>
> On Tue, Nov 4, 2014 at 10:54 PM, Akhil Das 
> wrote:
>
>> How about Using SparkSQL ?
>>
>> Thanks
>> Best Regards
>>
>> On Wed, Nov 5, 2014 at 1:53 AM, Benyi Wang  wrote:
>>
>>> I need to join RDD[A], RDD[B], and RDD[C]. Here is what I did,
>>>
>>> # build (K,V) from A and B to prepare the join
>>>
>>> val ja = A.map( r => (K1, Va))
>>> val jb = B.map( r => (K1, Vb))
>>>
>>> # join A, B
>>>
>>> val jab = ja.join(jb)
>>>
>>> # build (K,V) from the joined result of A and B to prepare joining with C
>>>
>>> val jc = C.map(r => (K2, Vc))
>>> jab.join(jc).map( => (K,V) ).reduceByKey(_ + _)
>>>
>>> Because A may have multiple fields, so Va is a tuple with more than 2
>>> fields. It is said that scala Tuple may not be specialized, and there is
>>> boxing/unboxing issue, so I tried to use "case class" for Va, Vb, and Vc,
>>> K2 and K which are compound keys, and V is a pair of count and ratio, _+_
>>> will create a new ratio. I register those case classes in Kryo.
>>>
>>> The sizes of Shuffle read/write look smaller. But I found GC overhead is
>>> really high: GC Time is about 20~30% of duration for the reduceByKey task.
>>> I think a lot of new objects are created using case classes during
>>> map/reduce.
>>>
>>> How to make the thing better?
>>>
>>
>>
>

Re: spark_ec2.py for AWS region: cn-north-1, China

2014-11-04 Thread haitao .yao

Done, JIRA link: https://issues.apache.org/jira/browse/SPARK-4241

Thanks.

2014-11-05 10:58 GMT+08:00 Nicholas Chammas :

> Oh, I can see that region via boto as well. Perhaps the doc is indeed out
> of date.
>
> Do you mind opening a JIRA issue
>  to track this
> request? I can do it if you've never opened a JIRA issue before.
>
> Nick
>
> On Tue, Nov 4, 2014 at 9:03 PM, haitao .yao  wrote:
>
>> I'm afraid not. We have been using EC2 instances in cn-north-1 region for
>> a while. And the latest version of boto has added the region: cn-north-1
>> Here's the  screenshot:
>>  from  boto import ec2
>> >>> ec2.regions()
>> [RegionInfo:us-east-1, RegionInfo:cn-north-1, RegionInfo:ap-northeast-1,
>> RegionInfo:eu-west-1, RegionInfo:ap-southeast-1, RegionInfo:ap-southeast-2,
>> RegionInfo:us-west-2, RegionInfo:us-gov-west-1, RegionInfo:us-west-1,
>> RegionInfo:eu-central-1, RegionInfo:sa-east-1]
>> >>>
>>
>> I do think the doc is out of dated.
>>
>>
>>
>> 2014-11-05 9:45 GMT+08:00 Nicholas Chammas :
>>
>>>
>>> http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html
>>>
>>> cn-north-1 is not a supported region for EC2, as far as I can tell.
>>> There may be other AWS services that can use that region, but spark-ec2
>>> relies on EC2.
>>>
>>> Nick
>>>
>>> On Tue, Nov 4, 2014 at 8:09 PM, haitao .yao  wrote:
>>>
 Hi,
Amazon aws started to provide service for China mainland, the region
 name is cn-north-1. But the script spark provides: spark_ec2.py will query
 ami id from https://github.com/mesos/spark-ec2/tree/v4/ami-list and
 there's no ami information for cn-north-1 region .
Can anybody update the ami information and update the reo:
 https://github.com/mesos/spark-ec2.git ?

Thanks.

 --
 haitao.yao




>>>
>>
>>
>> --
>> haitao.yao
>>
>>
>>
>>
>


-- 
haitao.yao

Re: Best practice for join

2014-11-04 Thread Benyi Wang

I'm using spark-1.0.0 in CDH 5.1.0. The big problem is SparkSQL doesn't
support Hash join in this version.

On Tue, Nov 4, 2014 at 10:54 PM, Akhil Das 
wrote:

> How about Using SparkSQL ?
>
> Thanks
> Best Regards
>
> On Wed, Nov 5, 2014 at 1:53 AM, Benyi Wang  wrote:
>
>> I need to join RDD[A], RDD[B], and RDD[C]. Here is what I did,
>>
>> # build (K,V) from A and B to prepare the join
>>
>> val ja = A.map( r => (K1, Va))
>> val jb = B.map( r => (K1, Vb))
>>
>> # join A, B
>>
>> val jab = ja.join(jb)
>>
>> # build (K,V) from the joined result of A and B to prepare joining with C
>>
>> val jc = C.map(r => (K2, Vc))
>> jab.join(jc).map( => (K,V) ).reduceByKey(_ + _)
>>
>> Because A may have multiple fields, so Va is a tuple with more than 2
>> fields. It is said that scala Tuple may not be specialized, and there is
>> boxing/unboxing issue, so I tried to use "case class" for Va, Vb, and Vc,
>> K2 and K which are compound keys, and V is a pair of count and ratio, _+_
>> will create a new ratio. I register those case classes in Kryo.
>>
>> The sizes of Shuffle read/write look smaller. But I found GC overhead is
>> really high: GC Time is about 20~30% of duration for the reduceByKey task.
>> I think a lot of new objects are created using case classes during
>> map/reduce.
>>
>> How to make the thing better?
>>
>
>

Re: Kafka Consumer in Spark Streaming

2014-11-04 Thread Akhil Das

Your code doesn't trigger any action. How about the following?

JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, new
Duration(60 * 1 * 1000));

JavaPairReceiverInputDStream tweets = KafkaUtils.createStream(ssc,
":2181", "1", map);

JavaDStream statuses = tweets.map(
new Function() {
public String call(String status) {
System.out.println(status);
return status;
}
}
);


statuses.print()
;
 



Or you could use foreachRDD instead of map() if your intention is just
printing.

Thanks
Best Regards

On Wed, Nov 5, 2014 at 12:35 PM, Something Something <
mailinglist...@gmail.com> wrote:

> It's not local.  My spark url is something like this:
>
> String sparkUrl = "spark://:7077";
>
>
> On Tue, Nov 4, 2014 at 11:03 PM, Jain Rahul  wrote:
>
>>
>>  I think you are running it locally.
>> Do you have local[1] here for master url? If yes change it to local[2] or
>> more number of threads.
>> It may be due to topic name mismatch also.
>>
>>  sparkConf.setMaster(“local[1]");
>>
>>  Regards,
>> Rahul
>>
>>   From: Something Something 
>> Date: Wednesday, November 5, 2014 at 12:23 PM
>> To: "Shao, Saisai" 
>> Cc: "user@spark.apache.org" 
>>
>> Subject: Re: Kafka Consumer in Spark Streaming
>>
>>   Added foreach as follows.  Still don't see any output on my console.
>> Would this go to the worker logs as Jerry indicated?
>>
>> JavaPairReceiverInputDStream tweets =
>> KafkaUtils.createStream(ssc, ":2181", "1", map);
>> JavaDStream statuses = tweets.map(
>> new Function() {
>> public String call(String status) {
>> return status;
>> }
>> }
>> );
>>
>> statuses.foreach(new Function, Void>() {
>> @Override
>> public Void call(JavaRDD stringJavaRDD) throws
>> Exception {
>> for (String str: stringJavaRDD.take(10)) {
>> System.out.println("Message: " + str);
>> }
>> return null;
>> }
>> });
>>
>>
>> On Tue, Nov 4, 2014 at 10:32 PM, Shao, Saisai 
>> wrote:
>>
>>>  If you’re running on a standalone mode, the log is under
>>> /work/ directory. I’m not sure for yarn or mesos, you can check
>>> the document of Spark to see the details.
>>>
>>>
>>>
>>> Thanks
>>>
>>> Jerry
>>>
>>>
>>>
>>> *From:* Something Something [mailto:mailinglist...@gmail.com]
>>> *Sent:* Wednesday, November 05, 2014 2:28 PM
>>> *To:* Shao, Saisai
>>> *Cc:* user@spark.apache.org
>>> *Subject:* Re: Kafka Consumer in Spark Streaming
>>>
>>>
>>>
>>> The Kafka broker definitely has messages coming in.  But your #2 point
>>> is valid.  Needless to say I am a newbie to Spark.  I can't figure out
>>> where the 'executor' logs would be.  How would I find them?
>>>
>>> All I see printed on my screen is this:
>>>
>>> 14/11/04 22:21:23 INFO Slf4jLogger: Slf4jLogger started
>>> 14/11/04 22:21:23 INFO Remoting: Starting remoting
>>> 14/11/04 22:21:24 INFO Remoting: Remoting started; listening on
>>> addresses :[akka.tcp://spark@mymachie:60743]
>>> 14/11/04 22:21:24 INFO Remoting: Remoting now listens on addresses:
>>> [akka.tcp://spark@mymachine:60743]
>>> 14/11/04 22:21:24 WARN NativeCodeLoader: Unable to load native-hadoop
>>> library for your platform... using builtin-java classes where applicable
>>> 14/11/04 22:21:24 INFO JniBasedUnixGroupsMappingWithFallback: Falling
>>> back to shell based
>>> ---
>>> Time: 141516852 ms
>>> ---
>>> ---
>>> Time: 141516852 ms
>>> ---
>>>
>>> Keeps repeating this...
>>>
>>>
>>>
>>> On Tue, Nov 4, 2014 at 10:14 PM, Shao, Saisai 
>>> wrote:
>>>
>>>  Hi, would you mind describing your problem a little more specific.
>>>
>>>
>>>
>>> 1.  Is the Kafka broker currently has no data feed in?
>>>
>>> 2.  This code will print the lines, but not in the driver side, the
>>> code is running in the executor side, so you can check the log in worker
>>> dir to see if there’s any printing logs under this folder.
>>>
>>> 3.  Did you see any exceptions when running the app, this will help
>>> to define the problem.
>>>
>>>
>>>
>>> Thanks
>>>
>>> Jerry
>>>
>>>
>>>
>>> *From:* Something Something [mailto:mailinglist...@gmail.com]
>>> *Sent:* Wednesday, November 05, 2014 1:57 PM
>>> *To:* user@spark.apache.org
>>> *Subject:* Kafka Consumer in Spark Streaming
>>>
>>>
>>>
>>> I've following code in my program.  I don't get any error, but it's not
>>> consuming the messages either.  Shouldn't the following code print the line
>>> in the 'call' method?  What am I missing?
>>>
>>> Please help.  Thanks.
>>>
>>>
>>>
>>> JavaStreamingContext ssc = new JavaStreamingContext(

Re: Kafka Consumer in Spark Streaming

2014-11-04 Thread Something Something

It's not local.  My spark url is something like this:

String sparkUrl = "spark://:7077";


On Tue, Nov 4, 2014 at 11:03 PM, Jain Rahul  wrote:

>
>  I think you are running it locally.
> Do you have local[1] here for master url? If yes change it to local[2] or
> more number of threads.
> It may be due to topic name mismatch also.
>
>  sparkConf.setMaster(“local[1]");
>
>  Regards,
> Rahul
>
>   From: Something Something 
> Date: Wednesday, November 5, 2014 at 12:23 PM
> To: "Shao, Saisai" 
> Cc: "user@spark.apache.org" 
>
> Subject: Re: Kafka Consumer in Spark Streaming
>
>   Added foreach as follows.  Still don't see any output on my console.
> Would this go to the worker logs as Jerry indicated?
>
> JavaPairReceiverInputDStream tweets = KafkaUtils.createStream(ssc,
> ":2181", "1", map);
> JavaDStream statuses = tweets.map(
> new Function() {
> public String call(String status) {
> return status;
> }
> }
> );
>
> statuses.foreach(new Function, Void>() {
> @Override
> public Void call(JavaRDD stringJavaRDD) throws
> Exception {
> for (String str: stringJavaRDD.take(10)) {
> System.out.println("Message: " + str);
> }
> return null;
> }
> });
>
>
> On Tue, Nov 4, 2014 at 10:32 PM, Shao, Saisai 
> wrote:
>
>>  If you’re running on a standalone mode, the log is under
>> /work/ directory. I’m not sure for yarn or mesos, you can check
>> the document of Spark to see the details.
>>
>>
>>
>> Thanks
>>
>> Jerry
>>
>>
>>
>> *From:* Something Something [mailto:mailinglist...@gmail.com]
>> *Sent:* Wednesday, November 05, 2014 2:28 PM
>> *To:* Shao, Saisai
>> *Cc:* user@spark.apache.org
>> *Subject:* Re: Kafka Consumer in Spark Streaming
>>
>>
>>
>> The Kafka broker definitely has messages coming in.  But your #2 point is
>> valid.  Needless to say I am a newbie to Spark.  I can't figure out where
>> the 'executor' logs would be.  How would I find them?
>>
>> All I see printed on my screen is this:
>>
>> 14/11/04 22:21:23 INFO Slf4jLogger: Slf4jLogger started
>> 14/11/04 22:21:23 INFO Remoting: Starting remoting
>> 14/11/04 22:21:24 INFO Remoting: Remoting started; listening on addresses
>> :[akka.tcp://spark@mymachie:60743]
>> 14/11/04 22:21:24 INFO Remoting: Remoting now listens on addresses:
>> [akka.tcp://spark@mymachine:60743]
>> 14/11/04 22:21:24 WARN NativeCodeLoader: Unable to load native-hadoop
>> library for your platform... using builtin-java classes where applicable
>> 14/11/04 22:21:24 INFO JniBasedUnixGroupsMappingWithFallback: Falling
>> back to shell based
>> ---
>> Time: 141516852 ms
>> ---
>> ---
>> Time: 141516852 ms
>> ---
>>
>> Keeps repeating this...
>>
>>
>>
>> On Tue, Nov 4, 2014 at 10:14 PM, Shao, Saisai 
>> wrote:
>>
>>  Hi, would you mind describing your problem a little more specific.
>>
>>
>>
>> 1.  Is the Kafka broker currently has no data feed in?
>>
>> 2.  This code will print the lines, but not in the driver side, the
>> code is running in the executor side, so you can check the log in worker
>> dir to see if there’s any printing logs under this folder.
>>
>> 3.  Did you see any exceptions when running the app, this will help
>> to define the problem.
>>
>>
>>
>> Thanks
>>
>> Jerry
>>
>>
>>
>> *From:* Something Something [mailto:mailinglist...@gmail.com]
>> *Sent:* Wednesday, November 05, 2014 1:57 PM
>> *To:* user@spark.apache.org
>> *Subject:* Kafka Consumer in Spark Streaming
>>
>>
>>
>> I've following code in my program.  I don't get any error, but it's not
>> consuming the messages either.  Shouldn't the following code print the line
>> in the 'call' method?  What am I missing?
>>
>> Please help.  Thanks.
>>
>>
>>
>> JavaStreamingContext ssc = new JavaStreamingContext(sparkConf,
>> new Duration(60 * 1 * 1000));
>>
>> JavaPairReceiverInputDStream tweets =
>> KafkaUtils.createStream(ssc, ":2181", "1", map);
>>
>> JavaDStream statuses = tweets.map(
>> new Function() {
>> public String call(String status) {
>> System.out.println(status);
>> return status;
>> }
>> }
>> );
>>
>>
>>
>
>   This email and any attachments are confidential, and may be legally
> privileged and protected by copyright. If you are not the intended
> recipient dissemination or copying of this email is prohibited. If you have
> received this in error, please notify the sender by replying by email and
> then delete the email completely from your system. Any views or opinions
> are solely those of the sender. This communication is not intende

Re: Kafka Consumer in Spark Streaming

2014-11-04 Thread Jain Rahul


I think you are running it locally.
Do you have local[1] here for master url? If yes change it to local[2] or more 
number of threads.
It may be due to topic name mismatch also.

sparkConf.setMaster(“local[1]");

Regards,
Rahul

From: Something Something 
mailto:mailinglist...@gmail.com>>
Date: Wednesday, November 5, 2014 at 12:23 PM
To: "Shao, Saisai" mailto:saisai.s...@intel.com>>
Cc: "user@spark.apache.org" 
mailto:user@spark.apache.org>>
Subject: Re: Kafka Consumer in Spark Streaming

Added foreach as follows.  Still don't see any output on my console.  Would 
this go to the worker logs as Jerry indicated?

JavaPairReceiverInputDStream tweets = KafkaUtils.createStream(ssc, 
":2181", "1", map);
JavaDStream statuses = tweets.map(
new Function() {
public String call(String status) {
return status;
}
}
);

statuses.foreach(new Function, Void>() {
@Override
public Void call(JavaRDD stringJavaRDD) throws Exception {
for (String str: stringJavaRDD.take(10)) {
System.out.println("Message: " + str);
}
return null;
}
});


On Tue, Nov 4, 2014 at 10:32 PM, Shao, Saisai 
mailto:saisai.s...@intel.com>> wrote:
If you’re running on a standalone mode, the log is under /work/ 
directory. I’m not sure for yarn or mesos, you can check the document of Spark 
to see the details.

Thanks
Jerry

From: Something Something 
[mailto:mailinglist...@gmail.com]
Sent: Wednesday, November 05, 2014 2:28 PM
To: Shao, Saisai
Cc: user@spark.apache.org
Subject: Re: Kafka Consumer in Spark Streaming

The Kafka broker definitely has messages coming in.  But your #2 point is 
valid.  Needless to say I am a newbie to Spark.  I can't figure out where the 
'executor' logs would be.  How would I find them?
All I see printed on my screen is this:

14/11/04 22:21:23 INFO Slf4jLogger: Slf4jLogger started
14/11/04 22:21:23 INFO Remoting: Starting remoting
14/11/04 22:21:24 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://spark@mymachie:60743]
14/11/04 22:21:24 INFO Remoting: Remoting now listens on addresses: 
[akka.tcp://spark@mymachine:60743]
14/11/04 22:21:24 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
14/11/04 22:21:24 INFO JniBasedUnixGroupsMappingWithFallback: Falling back to 
shell based
---
Time: 141516852 ms
---
---
Time: 141516852 ms
---
Keeps repeating this...

On Tue, Nov 4, 2014 at 10:14 PM, Shao, Saisai 
mailto:saisai.s...@intel.com>> wrote:
Hi, would you mind describing your problem a little more specific.


1.  Is the Kafka broker currently has no data feed in?

2.  This code will print the lines, but not in the driver side, the code is 
running in the executor side, so you can check the log in worker dir to see if 
there’s any printing logs under this folder.

3.  Did you see any exceptions when running the app, this will help to 
define the problem.

Thanks
Jerry

From: Something Something 
[mailto:mailinglist...@gmail.com]
Sent: Wednesday, November 05, 2014 1:57 PM
To: user@spark.apache.org
Subject: Kafka Consumer in Spark Streaming

I've following code in my program.  I don't get any error, but it's not 
consuming the messages either.  Shouldn't the following code print the line in 
the 'call' method?  What am I missing?

Please help.  Thanks.



JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, new 
Duration(60 * 1 * 1000));

JavaPairReceiverInputDStream tweets = KafkaUtils.createStream(ssc, 
":2181", "1", map);

JavaDStream statuses = tweets.map(
new Function() {
public String call(String status) {
System.out.println(status);
return status;
}
}
);


This email and any attachments are confidential, and may be legally privileged 
and protected by copyright. If you are not the intended recipient dissemination 
or copying of this email is prohibited. If you have received this in error, 
please notify the sender by replying by email and then delete the email 
completely from your system. Any views or opinions are solely those of the 
sender. This communication is not intended to form a binding contract unless 
expressly indicated to the contrary and properly authorised. Any actions taken 
on the basis of this email are at the recipient's own risk.

Re: Best practice for join

2014-11-04 Thread Akhil Das

How about Using SparkSQL ?

Thanks
Best Regards

On Wed, Nov 5, 2014 at 1:53 AM, Benyi Wang  wrote:

> I need to join RDD[A], RDD[B], and RDD[C]. Here is what I did,
>
> # build (K,V) from A and B to prepare the join
>
> val ja = A.map( r => (K1, Va))
> val jb = B.map( r => (K1, Vb))
>
> # join A, B
>
> val jab = ja.join(jb)
>
> # build (K,V) from the joined result of A and B to prepare joining with C
>
> val jc = C.map(r => (K2, Vc))
> jab.join(jc).map( => (K,V) ).reduceByKey(_ + _)
>
> Because A may have multiple fields, so Va is a tuple with more than 2
> fields. It is said that scala Tuple may not be specialized, and there is
> boxing/unboxing issue, so I tried to use "case class" for Va, Vb, and Vc,
> K2 and K which are compound keys, and V is a pair of count and ratio, _+_
> will create a new ratio. I register those case classes in Kryo.
>
> The sizes of Shuffle read/write look smaller. But I found GC overhead is
> really high: GC Time is about 20~30% of duration for the reduceByKey task.
> I think a lot of new objects are created using case classes during
> map/reduce.
>
> How to make the thing better?
>

Re: GraphX and Spark

2014-11-04 Thread Kamal Banga

GraphX is build on *top* of Spark, so Spark can achieve whatever GraphX can.

On Wed, Nov 5, 2014 at 9:41 AM, Deep Pradhan 
wrote:

> Hi,
> Can Spark achieve whatever GraphX can?
> Keeping aside the performance comparison between Spark and GraphX, if I
> want to implement any graph algorithm and I do not want to use GraphX, can
> I get the work done with Spark?
>
> Than You
>

Re: Kafka Consumer in Spark Streaming

2014-11-04 Thread Something Something

Added foreach as follows.  Still don't see any output on my console.  Would
this go to the worker logs as Jerry indicated?

JavaPairReceiverInputDStream tweets = KafkaUtils.createStream(ssc,
":2181", "1", map);
JavaDStream statuses = tweets.map(
new Function() {
public String call(String status) {
return status;
}
}
);

statuses.foreach(new Function, Void>() {
@Override
public Void call(JavaRDD stringJavaRDD) throws
Exception {
for (String str: stringJavaRDD.take(10)) {
System.out.println("Message: " + str);
}
return null;
}
});


On Tue, Nov 4, 2014 at 10:32 PM, Shao, Saisai  wrote:

>  If you’re running on a standalone mode, the log is under
> /work/ directory. I’m not sure for yarn or mesos, you can check
> the document of Spark to see the details.
>
>
>
> Thanks
>
> Jerry
>
>
>
> *From:* Something Something [mailto:mailinglist...@gmail.com]
> *Sent:* Wednesday, November 05, 2014 2:28 PM
> *To:* Shao, Saisai
> *Cc:* user@spark.apache.org
> *Subject:* Re: Kafka Consumer in Spark Streaming
>
>
>
> The Kafka broker definitely has messages coming in.  But your #2 point is
> valid.  Needless to say I am a newbie to Spark.  I can't figure out where
> the 'executor' logs would be.  How would I find them?
>
> All I see printed on my screen is this:
>
> 14/11/04 22:21:23 INFO Slf4jLogger: Slf4jLogger started
> 14/11/04 22:21:23 INFO Remoting: Starting remoting
> 14/11/04 22:21:24 INFO Remoting: Remoting started; listening on addresses
> :[akka.tcp://spark@mymachie:60743]
> 14/11/04 22:21:24 INFO Remoting: Remoting now listens on addresses:
> [akka.tcp://spark@mymachine:60743]
> 14/11/04 22:21:24 WARN NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> 14/11/04 22:21:24 INFO JniBasedUnixGroupsMappingWithFallback: Falling back
> to shell based
> ---
> Time: 141516852 ms
> ---
> ---
> Time: 141516852 ms
> ---
>
> Keeps repeating this...
>
>
>
> On Tue, Nov 4, 2014 at 10:14 PM, Shao, Saisai 
> wrote:
>
>  Hi, would you mind describing your problem a little more specific.
>
>
>
> 1.  Is the Kafka broker currently has no data feed in?
>
> 2.  This code will print the lines, but not in the driver side, the
> code is running in the executor side, so you can check the log in worker
> dir to see if there’s any printing logs under this folder.
>
> 3.  Did you see any exceptions when running the app, this will help
> to define the problem.
>
>
>
> Thanks
>
> Jerry
>
>
>
> *From:* Something Something [mailto:mailinglist...@gmail.com]
> *Sent:* Wednesday, November 05, 2014 1:57 PM
> *To:* user@spark.apache.org
> *Subject:* Kafka Consumer in Spark Streaming
>
>
>
> I've following code in my program.  I don't get any error, but it's not
> consuming the messages either.  Shouldn't the following code print the line
> in the 'call' method?  What am I missing?
>
> Please help.  Thanks.
>
>
>
> JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, new
> Duration(60 * 1 * 1000));
>
> JavaPairReceiverInputDStream tweets = KafkaUtils.createStream(ssc,
> ":2181", "1", map);
>
> JavaDStream statuses = tweets.map(
> new Function() {
> public String call(String status) {
> System.out.println(status);
> return status;
> }
> }
> );
>
>
>

How to increase hdfs read parallelism

2014-11-04 Thread Rajat Verma

Hi
I have simple use case where I have to join two feeds. I have two worker
nodes each having 96 GB memory and 24 cores. I am running spark(1.1.0) with
yarn(2.4.0).
I have allocated 80% resources to spark queue and my spark config looks like
spark.executor.cores=18
spark.executor.memory=66g
spark.executor.instances=2

My jobs schedules 400 tasks and at a time close to 40 tasks run in
parallel. My job is IO bound and CPU utilisation is less than 40%.
Spark tuning page recommends to configure 2-3 tasks per CPU core. I have
changed spark.default.parallelism to 80 but still it is running only
40(approx) tasks at a time.
How can I run more tasks in parallel.

One more question, I have to cache combined RDD after join. Should I run 4
executers with 32GB memory and set -XX:+UseCompressedOops?? what are pros
and cons of doing it.

Thanks.

Re: save as JSON objects

2014-11-04 Thread Akhil Das

Something like this?

val json = myRDD.map(*map_obj* => new JSONObject(*map_obj*))

Here map_obj will be a map containing values (eg: *Map("name" -> "Akhil",
"mail" -> "xyz@xyz")*)

Performance wasn't so good with this one though.

Thanks
Best Regards

On Wed, Nov 5, 2014 at 3:02 AM, Yin Huai  wrote:

> Hello Andrejs,
>
> For now, you need to use a JSON lib to serialize records of your datasets
> as JSON strings. In future, we will add a method to SchemaRDD to let you
> write a SchemaRDD in JSON format (I have created
> https://issues.apache.org/jira/browse/SPARK-4228 to track it).
>
> Thanks,
>
> Yin
>
> On Tue, Nov 4, 2014 at 3:48 AM, Andrejs Abele <
> andrejs.ab...@insight-centre.org> wrote:
>
>> Hi,
>> Can some one pleas sugest me, what is the best way to output spark data
>> as JSON file. (File where each line is a JSON object)
>> Cheers,
>> Andrejs
>>
>
>

Re: stackoverflow error

2014-11-04 Thread Sean Owen

With so many iterations, your RDD lineage is too deep. You should not
need nearly so many iterations. 10 or 20 is usually plenty.

On Tue, Nov 4, 2014 at 11:13 PM, Hongbin Liu  wrote:
> Hi, can you help with the following? We are new to spark.
>
>
>
> Error stack:
>
>
>
> 14/11/04 18:08:03 INFO SparkContext: Job finished: count at ALS.scala:314,
> took 480.318100288 s
>
> Exception in thread "main" java.lang.StackOverflowError

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Issue in Spark Streaming

2014-11-04 Thread Akhil Das

Which error are you referring here? Can you paste the error logs?

Thanks
Best Regards

On Wed, Nov 5, 2014 at 11:04 AM, Suman S Patil 
wrote:

>  I am trying to run the Spark streaming program as given in the Spark
> streaming Programming guide
> ,
> in the interactive shell. I am getting an error as shown here as an
> intermediate step. It resumes the run on its own like this. Please let me
> know how to overcome this problem.
>
> Thanks in advance
>
>
>
>
>
>
>
> Regards,
>
> Suman S Patil
>
>
>
> --
> The contents of this e-mail and any attachment(s) may contain confidential
> or privileged information for the intended recipient(s). Unintended
> recipients are prohibited from taking action on the basis of information in
> this e-mail and using or disseminating the information, and must notify the
> sender and delete it from their system. L&T Infotech will not accept
> responsibility or liability for the accuracy or completeness of, or the
> presence of any virus or disabling code in this e-mail"
>

RE: Kafka Consumer in Spark Streaming

2014-11-04 Thread Shao, Saisai

If you’re running on a standalone mode, the log is under /work/ 
directory. I’m not sure for yarn or mesos, you can check the document of Spark 
to see the details.

Thanks
Jerry

From: Something Something [mailto:mailinglist...@gmail.com]
Sent: Wednesday, November 05, 2014 2:28 PM
To: Shao, Saisai
Cc: user@spark.apache.org
Subject: Re: Kafka Consumer in Spark Streaming

The Kafka broker definitely has messages coming in.  But your #2 point is 
valid.  Needless to say I am a newbie to Spark.  I can't figure out where the 
'executor' logs would be.  How would I find them?
All I see printed on my screen is this:

14/11/04 22:21:23 INFO Slf4jLogger: Slf4jLogger started
14/11/04 22:21:23 INFO Remoting: Starting remoting
14/11/04 22:21:24 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://spark@mymachie:60743]
14/11/04 22:21:24 INFO Remoting: Remoting now listens on addresses: 
[akka.tcp://spark@mymachine:60743]
14/11/04 22:21:24 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
14/11/04 22:21:24 INFO JniBasedUnixGroupsMappingWithFallback: Falling back to 
shell based
---
Time: 141516852 ms
---
---
Time: 141516852 ms
---
Keeps repeating this...

On Tue, Nov 4, 2014 at 10:14 PM, Shao, Saisai 
mailto:saisai.s...@intel.com>> wrote:
Hi, would you mind describing your problem a little more specific.

1.  Is the Kafka broker currently has no data feed in?

2.  This code will print the lines, but not in the driver side, the code is 
running in the executor side, so you can check the log in worker dir to see if 
there’s any printing logs under this folder.

3.  Did you see any exceptions when running the app, this will help to 
define the problem.

Thanks
Jerry

From: Something Something 
[mailto:mailinglist...@gmail.com]
Sent: Wednesday, November 05, 2014 1:57 PM
To: user@spark.apache.org
Subject: Kafka Consumer in Spark Streaming

I've following code in my program.  I don't get any error, but it's not 
consuming the messages either.  Shouldn't the following code print the line in 
the 'call' method?  What am I missing?

Please help.  Thanks.

JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, new 
Duration(60 * 1 * 1000));

JavaPairReceiverInputDStream tweets = KafkaUtils.createStream(ssc, 
":2181", "1", map);

JavaDStream statuses = tweets.map(
new Function() {
public String call(String status) {
System.out.println(status);
return status;
}
}
);

Re: ERROR UserGroupInformation: PriviledgedActionException

2014-11-04 Thread Akhil Das

Its more like you are having different versions of spark

Thanks
Best Regards

On Wed, Nov 5, 2014 at 3:05 AM, Saiph Kappa  wrote:

> I set the host and port of the driver and now the error slightly changed
>
> Using Spark's default log4j profile:
>> org/apache/spark/log4j-defaults.properties
>> 14/11/04 21:13:48 INFO CoarseGrainedExecutorBackend: Registered signal
>> handlers for [TERM, HUP, INT]
>> 14/11/04 21:13:48 INFO SecurityManager: Changing view acls to:
>> myuser,Myuser
>> 14/11/04 21:13:48 INFO SecurityManager: Changing modify acls to:
>> myuser,Myuser
>> 14/11/04 21:13:48 INFO SecurityManager: SecurityManager: authentication
>> disabled; ui acls disabled; users with view permissions: Set(myuser,
>> Myuser); users with modify permissions: Set(myuser, Myuser)
>> 14/11/04 21:13:48 INFO Slf4jLogger: Slf4jLogger started
>> 14/11/04 21:13:48 INFO Remoting: Starting remoting
>> 14/11/04 21:13:49 INFO Remoting: Remoting started; listening on addresses
>> :[akka.tcp://driverPropsFetcher@myserver:37456]
>> 14/11/04 21:13:49 INFO Remoting: Remoting now listens on addresses:
>> [akka.tcp://driverPropsFetcher@myserver:37456]
>> 14/11/04 21:13:49 INFO Utils: Successfully started service
>> 'driverPropsFetcher' on port 37456.
>> 14/11/04 21:14:19 ERROR UserGroupInformation: PriviledgedActionException
>> as:Myuser cause:java.util.concurrent.TimeoutException: Futures timed out
>> after [30 seconds]
>> Exception in thread "main"
>> java.lang.reflect.UndeclaredThrowableException: Unknown exception in doAs
>> at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1134)
>> at
>> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:52)
>> at
>> org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:113)
>> at
>> org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:156)
>> at
>> org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
>> Caused by: java.security.PrivilegedActionException:
>> java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
>> at java.security.AccessController.doPrivileged(Native Method)
>> at javax.security.auth.Subject.doAs(Subject.java:415)
>> at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
>> ... 4 more
>> Caused by: java.util.concurrent.TimeoutException: Futures timed out after
>> [30 seconds]
>> at
>> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>> at
>> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>> at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>> at
>> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>> at scala.concurrent.Await$.result(package.scala:107)
>> at
>> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:125)
>> at
>> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:53)
>> at
>> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:52)
>> ... 7 more
>>
>
> Any ideas?
>
> Thanks.
>
> On Tue, Nov 4, 2014 at 11:29 AM, Akhil Das 
> wrote:
>
>> If you want to run the spark application from a remote machine, then you
>> have to at least set the following configurations properly.
>>
>> *spark.driver.host* - points to the ip/host from where you are
>> submitting the job (make sure you are able to ping this from the cluster)
>>
>> *spark.driver.port* - set it to a port number which is accessible from
>> the spark cluster.
>>
>> You can look at more configuration options over here.
>> 
>>
>> Thanks
>> Best Regards
>>
>> On Tue, Nov 4, 2014 at 6:07 AM, Saiph Kappa 
>> wrote:
>>
>>> Hi,
>>>
>>> I am trying to submit a job to a spark cluster running on a single
>>> machine (1 master + 1 worker) with hadoop 1.0.4. I submit it in the code:
>>> «val sparkConf = new
>>> SparkConf().setMaster("spark://myserver:7077").setAppName("MyApp").setJars(Array("target/my-app-1.0-SNAPSHOT.jar"))».
>>>
>>> When I run this application on the same machine as the cluster
>>> everything works fine.
>>>
>>> But when I run it from a remote machine I get the following error:
>>>
>>> Using Spark's default log4j profile:
 org/apache/spark/log4j-defaults.properties
 14/11/04 00:15:38 INFO CoarseGrainedExecutorBackend: Registered signal
 handlers for [TERM, HUP, INT]
 14/11/04 00:15:38 INFO SecurityManager: Changing view acls to:
 myuser,Myuser
 14/11/04 00:15:38 INFO SecurityManager: Changing modify acls to:
 myuser,Myuser
 14/11/04 00:15:38 INFO SecurityManager: SecurityManager: authentication
 disabled; ui acls disabled; users with view permissions: Set(myuser,
 Myuser); users with modify

RE: MEMORY_ONLY_SER question

2014-11-04 Thread Shao, Saisai

From my understanding, the Spark code use Kryo as a streaming manner for RDD 
partitions, the deserialization comes with iteration to move forward. But the 
internal thing of Kryo to deserialize all the object once or incrementally is 
actually a behavior of Kryo, I guess Kyro will not deserialize the objects once 
for all.

Thanks
Jerry

From: Mohit Jaggi [mailto:mohitja...@gmail.com]
Sent: Wednesday, November 05, 2014 2:01 PM
To: Tathagata Das
Cc: user@spark.apache.org
Subject: Re: MEMORY_ONLY_SER question

I used the word "streaming" but I did not mean to refer to spark streaming. I 
meant if a partition containing 10 objects was kryo-serialized into a single 
buffer, then in a mapPartitions() call, as I call iter.next() 10 times to 
access these objects one at a time, does the deserialization happen
a) once to get all 10 objects,
b) 10 times "incrementally" to get an object at a time, or
c) 10 times to get 10 objects and discard the "wrong" 9 objects [ i doubt this 
would a design anyone would have adopted ]
I think your answer is option (a) and you refered to Spark streaming to 
indicate that there is no difference in its behavior from spark core...right?

If it is indeed option (a), I am happy with it and don't need to customize. If 
it is (b), I would like to have (a) instead.

I am also wondering if kryo is good at compression of strings and numbers. 
Often I have the data type as "Double" but it could be encoded in much fewer 
bits.



On Tue, Nov 4, 2014 at 1:02 PM, Tathagata Das 
mailto:tathagata.das1...@gmail.com>> wrote:
It it deserialized in a streaming manner as the iterator moves over the 
partition. This is a functionality of core Spark, and Spark Streaming just uses 
it as is.
What do you want to customize it to?

On Tue, Nov 4, 2014 at 9:22 AM, Mohit Jaggi 
mailto:mohitja...@gmail.com>> wrote:
Folks,
If I have an RDD persisted in MEMORY_ONLY_SER mode and then it is needed for a 
transformation/action later, is the whole partition of the RDD deserialized 
into Java objects first before my transform/action code works on it? Or is it 
deserialized in a streaming manner as the iterator moves over the partition? Is 
this behavior customizable? I generally use the Kryo serializer.

Mohit.

Re: Kafka Consumer in Spark Streaming

2014-11-04 Thread Sean Owen

this code only expresses a transformation and so does not actually
cause any action. I think you intend to use foreachRDD.

On Wed, Nov 5, 2014 at 5:57 AM, Something Something
 wrote:
> I've following code in my program.  I don't get any error, but it's not
> consuming the messages either.  Shouldn't the following code print the line
> in the 'call' method?  What am I missing?
>
> Please help.  Thanks.
>
>
>
> JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, new
> Duration(60 * 1 * 1000));
>
> JavaPairReceiverInputDStream tweets = KafkaUtils.createStream(ssc,
> ":2181", "1", map);
>
> JavaDStream statuses = tweets.map(
> new Function() {
> public String call(String status) {
> System.out.println(status);
> return status;
> }
> }
> );
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Kafka Consumer in Spark Streaming

2014-11-04 Thread Something Something

The Kafka broker definitely has messages coming in.  But your #2 point is
valid.  Needless to say I am a newbie to Spark.  I can't figure out where
the 'executor' logs would be.  How would I find them?

All I see printed on my screen is this:

14/11/04 22:21:23 INFO Slf4jLogger: Slf4jLogger started
14/11/04 22:21:23 INFO Remoting: Starting remoting
14/11/04 22:21:24 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://spark@mymachie:60743]
14/11/04 22:21:24 INFO Remoting: Remoting now listens on addresses:
[akka.tcp://spark@mymachine:60743]
14/11/04 22:21:24 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
14/11/04 22:21:24 INFO JniBasedUnixGroupsMappingWithFallback: Falling back
to shell based
---
Time: 141516852 ms
---
---
Time: 141516852 ms
---
Keeps repeating this...

On Tue, Nov 4, 2014 at 10:14 PM, Shao, Saisai  wrote:

>  Hi, would you mind describing your problem a little more specific.
>
>
>
> 1.  Is the Kafka broker currently has no data feed in?
>
> 2.  This code will print the lines, but not in the driver side, the
> code is running in the executor side, so you can check the log in worker
> dir to see if there’s any printing logs under this folder.
>
> 3.  Did you see any exceptions when running the app, this will help
> to define the problem.
>
>
>
> Thanks
>
> Jerry
>
>
>
> *From:* Something Something [mailto:mailinglist...@gmail.com]
> *Sent:* Wednesday, November 05, 2014 1:57 PM
> *To:* user@spark.apache.org
> *Subject:* Kafka Consumer in Spark Streaming
>
>
>
> I've following code in my program.  I don't get any error, but it's not
> consuming the messages either.  Shouldn't the following code print the line
> in the 'call' method?  What am I missing?
>
> Please help.  Thanks.
>
>
>
> JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, new
> Duration(60 * 1 * 1000));
>
> JavaPairReceiverInputDStream tweets = KafkaUtils.createStream(ssc,
> ":2181", "1", map);
>
> JavaDStream statuses = tweets.map(
> new Function() {
> public String call(String status) {
> System.out.println(status);
> return status;
> }
> }
> );
>

RE: Kafka Consumer in Spark Streaming

2014-11-04 Thread Shao, Saisai

Hi, would you mind describing your problem a little more specific.


1.  Is the Kafka broker currently has no data feed in?

2.  This code will print the lines, but not in the driver side, the code is 
running in the executor side, so you can check the log in worker dir to see if 
there’s any printing logs under this folder.

3.  Did you see any exceptions when running the app, this will help to 
define the problem.

Thanks
Jerry

From: Something Something [mailto:mailinglist...@gmail.com]
Sent: Wednesday, November 05, 2014 1:57 PM
To: user@spark.apache.org
Subject: Kafka Consumer in Spark Streaming

I've following code in my program.  I don't get any error, but it's not 
consuming the messages either.  Shouldn't the following code print the line in 
the 'call' method?  What am I missing?

Please help.  Thanks.



JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, new 
Duration(60 * 1 * 1000));

JavaPairReceiverInputDStream tweets = KafkaUtils.createStream(ssc, 
":2181", "1", map);

JavaDStream statuses = tweets.map(
new Function() {
public String call(String status) {
System.out.println(status);
return status;
}
}
);

Re: MEMORY_ONLY_SER question

2014-11-04 Thread Mohit Jaggi

I used the word "streaming" but I did not mean to refer to spark streaming.
I meant if a partition containing 10 objects was kryo-serialized into a
single buffer, then in a mapPartitions() call, as I call iter.next() 10
times to access these objects one at a time, does the deserialization happen
a) once to get all 10 objects,
b) 10 times "incrementally" to get an object at a time, or
c) 10 times to get 10 objects and discard the "wrong" 9 objects [ i doubt
this would a design anyone would have adopted ]
I think your answer is option (a) and you refered to Spark streaming to
indicate that there is no difference in its behavior from spark
core...right?

If it is indeed option (a), I am happy with it and don't need to customize.
If it is (b), I would like to have (a) instead.

I am also wondering if kryo is good at compression of strings and numbers.
Often I have the data type as "Double" but it could be encoded in much
fewer bits.

On Tue, Nov 4, 2014 at 1:02 PM, Tathagata Das 
wrote:

> It it deserialized in a streaming manner as the iterator moves over the
> partition. This is a functionality of core Spark, and Spark Streaming just
> uses it as is.
> What do you want to customize it to?
>
> On Tue, Nov 4, 2014 at 9:22 AM, Mohit Jaggi  wrote:
>
>> Folks,
>> If I have an RDD persisted in MEMORY_ONLY_SER mode and then it is needed
>> for a transformation/action later, is the whole partition of the RDD
>> deserialized into Java objects first before my transform/action code works
>> on it? Or is it deserialized in a streaming manner as the iterator moves
>> over the partition? Is this behavior customizable? I generally use the Kryo
>> serializer.
>>
>> Mohit.
>>
>
>

Kafka Consumer in Spark Streaming

2014-11-04 Thread Something Something

I've following code in my program.  I don't get any error, but it's not
consuming the messages either.  Shouldn't the following code print the line
in the 'call' method?  What am I missing?

Please help.  Thanks.



JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, new
Duration(60 * 1 * 1000));

JavaPairReceiverInputDStream tweets = KafkaUtils.createStream(ssc,
":2181", "1", map);

JavaDStream statuses = tweets.map(
new Function() {
public String call(String status) {
System.out.println(status);
return status;
}
}
);

Issue in Spark Streaming

2014-11-04 Thread Suman S Patil

I am trying to run the Spark streaming program as given in the Spark streaming 
Programming 
guide, 
in the interactive shell. I am getting an error as shown 
here as an intermediate 
step. It resumes the run on its own like 
this. Please let me know 
how to overcome this problem.
Thanks in advance



Regards,
Suman S Patil



The contents of this e-mail and any attachment(s) may contain confidential or 
privileged information for the intended recipient(s). Unintended recipients are 
prohibited from taking action on the basis of information in this e-mail and 
using or disseminating the information, and must notify the sender and delete 
it from their system. L&T Infotech will not accept responsibility or liability 
for the accuracy or completeness of, or the presence of any virus or disabling 
code in this e-mail"

Re: Spark Streaming getOrCreate

2014-11-04 Thread sivarani

Anybody any luck? I am also trying to set NONE to delete key from state, will
null help? how to use scala none in java

My code goes this way 

public static class ScalaLang {

public  static  Option none() {
return (Option) None$.MODULE$;
}
}

 Function2, Optional, Optional>
updateFunction =
  new Function2, 
Optional,
Optional>() {
@Override public Optional 
call(List values,
Optional state) {
 Double newSum = state.or(0D); 
  if(values.isEmpty()){
  System.out.println("empty value");
  return null;  I WANT TO RETURN NONE 
TO DELETE KEY but when i
set ScalaLang.<>none(); it shows error
  }else{
  for (double i : values) {
 newSum += i;
  }
  return Optional.of(newSum);
  }
}
  };



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-getOrCreate-tp18060p18139.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

MLlib and PredictionIO sample code

2014-11-04 Thread Simon Chan

Hey guys,

I have written a tutorial on deploying MLlib's models on production with
open source PredictionIO: http://docs.prediction.io/0.8.1/templates/

The goal is to add the following features to MLlib, with production
application in mind:
- JSON query to retrieve prediction online
- Separation-of-concern software pattern based on the DASE architecture
- Support model update with new data

This first draft includes examples for Collaborative Filtering and
Classification only. I would love to hear your feedback!


Regards,
Simon

Re: loading, querying schemaRDD using SparkSQL

2014-11-04 Thread vdiwakar.malladi

Thanks Michael for your response.

Just now, i saw saveAsTable method on JavaSchemaRDD object (in Spark 1.1.0
API). But I couldn't find the corresponding documentation. Will that help?
Please let me know.

Thanks in advance.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/loading-querying-schemaRDD-using-SparkSQL-tp18052p18137.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

GraphX and Spark

2014-11-04 Thread Deep Pradhan

Hi,
Can Spark achieve whatever GraphX can?
Keeping aside the performance comparison between Spark and GraphX, if I
want to implement any graph algorithm and I do not want to use GraphX, can
I get the work done with Spark?

Than You

Re: pass unique ID to mllib algorithms pyspark

2014-11-04 Thread Xiangrui Meng

The proposed new set of APIs (SPARK-3573, SPARK-3530) will address
this issue. We "carry over" extra columns with training and prediction
and then leverage on Spark SQL's execution plan optimization to decide
which columns are really needed. For the current set of APIs, we can
add `predictOnValues` to models, which carries over the input keys.
StreamingKMeans and StreamingLinearRegression implement this method.
-Xiangrui

On Tue, Nov 4, 2014 at 2:30 AM, jamborta  wrote:
> Hi all,
>
> There are a few algorithms in pyspark where the prediction part is
> implemented in scala (e.g. ALS, decision trees) where it is not very easy to
> manipulate the prediction methods.
>
> I think it is a very common scenario that the user would like to generate
> prediction for a datasets, so that each predicted value is identifiable
> (e.g. have a unique id attached to it). this is not possible in the current
> implementation as predict functions take a feature vector and return the
> predicted values where, I believe, the order is not guaranteed, so there is
> no way to join it back with the original data the predictions are generated
> from.
>
> Is there a way around this at the moment?
>
> thanks,
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/pass-unique-ID-to-mllib-algorithms-pyspark-tp18051.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: stdout in spark applications

2014-11-04 Thread lokeshkumar

Got my answer from this thread,
http://apache-spark-user-list.1001560.n3.nabble.com/no-stdout-output-from-worker-td2437.html



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/stdout-in-spark-applications-tp18056p18134.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: spark_ec2.py for AWS region: cn-north-1, China

2014-11-04 Thread Nicholas Chammas

Oh, I can see that region via boto as well. Perhaps the doc is indeed out
of date.

Do you mind opening a JIRA issue
 to track this
request? I can do it if you've never opened a JIRA issue before.

Nick

On Tue, Nov 4, 2014 at 9:03 PM, haitao .yao  wrote:

> I'm afraid not. We have been using EC2 instances in cn-north-1 region for
> a while. And the latest version of boto has added the region: cn-north-1
> Here's the  screenshot:
>  from  boto import ec2
> >>> ec2.regions()
> [RegionInfo:us-east-1, RegionInfo:cn-north-1, RegionInfo:ap-northeast-1,
> RegionInfo:eu-west-1, RegionInfo:ap-southeast-1, RegionInfo:ap-southeast-2,
> RegionInfo:us-west-2, RegionInfo:us-gov-west-1, RegionInfo:us-west-1,
> RegionInfo:eu-central-1, RegionInfo:sa-east-1]
> >>>
>
> I do think the doc is out of dated.
>
>
>
> 2014-11-05 9:45 GMT+08:00 Nicholas Chammas :
>
>>
>> http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html
>>
>> cn-north-1 is not a supported region for EC2, as far as I can tell. There
>> may be other AWS services that can use that region, but spark-ec2 relies on
>> EC2.
>>
>> Nick
>>
>> On Tue, Nov 4, 2014 at 8:09 PM, haitao .yao  wrote:
>>
>>> Hi,
>>>Amazon aws started to provide service for China mainland, the region
>>> name is cn-north-1. But the script spark provides: spark_ec2.py will query
>>> ami id from https://github.com/mesos/spark-ec2/tree/v4/ami-list and
>>> there's no ami information for cn-north-1 region .
>>>Can anybody update the ami information and update the reo:
>>> https://github.com/mesos/spark-ec2.git ?
>>>
>>>Thanks.
>>>
>>> --
>>> haitao.yao
>>>
>>>
>>>
>>>
>>
>
>
> --
> haitao.yao
>
>
>
>

Re: Spark v Redshift

2014-11-04 Thread Vladimir Rodionov

>> We service templated queries from the appserver, i.e. user fills
>>out some forms, dropdowns: we translate to a query.

and

>>The target data
>>size is about a billion records, 20'ish fields, distributed throughout a
>>year (about 50GB on disk as CSV, uncompressed).

tells me that proprietary in memory app will be the best option for you.

I do not see any need for neither Spark nor Redshift in your case.



On Tue, Nov 4, 2014 at 5:41 PM, agfung  wrote:

> Sounds like context would help, I just didn't want to subject people to a
> wall of text if it wasn't necessary :)
>
> Currently we use neither Spark SQL (or anything in the Hadoop stack) or
> Redshift.  We service templated queries from the appserver, i.e. user fills
> out some forms, dropdowns: we translate to a query.
>
> Data is "basically" one table containing thousands of independent time
> series, with one or two tables of reference data to join to.  e.g. median
> value of Field1 from Table1 where Field2 from Table 2 matches X filter, T1
> and T2 joining on a surrogate key, group by a different Field3.  The data
> structure is a little bit dynamic.  User can upload any CSV, as long as
> they
> tell us the name of each column and the programmatic type.  The target data
> size is about a billion records, 20'ish fields, distributed throughout a
> year (about 50GB on disk as CSV, uncompressed).
>
> So we're currently doing "historical" analytics (e.g. see analytic results
> of only yesterday's data or older, but want to see the result "quickly").
> We eventually intend to do "realtime" (or "streaming") analytics (i.e. see
> the impact of new data on analytics "quickly").  Machine learning is also
> on
> the roadmap.
>
> One proposition is for Spark SQL as a complete replacement for Redshift.
> It
> would simplify the architecture, since our long term strategy is to handle
> data intake and ETL on HDFS (regardless of Redshift or Spark SQL).  The
> other parts of the Hadoop family that would come into play for ETL is
> undetermined right now.  Spark SQL appears to have relational ability, and
> if we're going to use the Hadoop stack for ML and streaming analytics, and
> it has the ability, why not do it all on one stack and not shovel data
> around?  Also, lots of people talking about it.
>
> The other proposition is Redshift as the historical analytics solution, and
> something else (could be Spark, doesn't matter) for streaming analytics and
> ML.   If we need to relate the two, we'll have an API or process to stitch
> it together.   I've read about the "lambda architecture", which more or
> less
> describes this approach.  The motivation is Redshift has the AWS
> reliability/scalability/operational concerns worked out, richer query
> language (SQL and pgsql functions are designed for slice-n-dice analytics)
> so we can spend our coding time elsewhere, and a measure of safety against
> design issues and bugs: Spark just came out of incubator status this year,
> and it's much easier to find people on the web raving positively about
> Redshift in real-world usage (i.e. part of live, client-facing system) than
> Spark.
>
> category_theory's observation that most of the speed comes from fitting in
> memory is helpful.  It's what I would have surmised from the AMPLab Big
> Data
> benchmark, but confirmation from the hands-on community is invaluable,
> thank
> you.
>
> I understand a lot of it simply has to do with what-do-you-value-more
> weightings, and we'll do prototypes/benchmarks if we have to, just wasn't
> sure if there were any other "key assumptions/requirements/gotchas" to
> consider.
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-v-Redshift-tp18112p18127.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: spark_ec2.py for AWS region: cn-north-1, China

2014-11-04 Thread haitao .yao

I'm afraid not. We have been using EC2 instances in cn-north-1 region for a
while. And the latest version of boto has added the region: cn-north-1
Here's the  screenshot:
 from  boto import ec2
>>> ec2.regions()
[RegionInfo:us-east-1, RegionInfo:cn-north-1, RegionInfo:ap-northeast-1,
RegionInfo:eu-west-1, RegionInfo:ap-southeast-1, RegionInfo:ap-southeast-2,
RegionInfo:us-west-2, RegionInfo:us-gov-west-1, RegionInfo:us-west-1,
RegionInfo:eu-central-1, RegionInfo:sa-east-1]
>>>

I do think the doc is out of dated.



2014-11-05 9:45 GMT+08:00 Nicholas Chammas :

>
> http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html
>
> cn-north-1 is not a supported region for EC2, as far as I can tell. There
> may be other AWS services that can use that region, but spark-ec2 relies on
> EC2.
>
> Nick
>
> On Tue, Nov 4, 2014 at 8:09 PM, haitao .yao  wrote:
>
>> Hi,
>>Amazon aws started to provide service for China mainland, the region
>> name is cn-north-1. But the script spark provides: spark_ec2.py will query
>> ami id from https://github.com/mesos/spark-ec2/tree/v4/ami-list and
>> there's no ami information for cn-north-1 region .
>>Can anybody update the ami information and update the reo:
>> https://github.com/mesos/spark-ec2.git ?
>>
>>Thanks.
>>
>> --
>> haitao.yao
>>
>>
>>
>>
>


-- 
haitao.yao

Re: Why mapred for the HadoopRDD?

2014-11-04 Thread raymond

You could take a look at sc.newAPIHadoopRDD()


在 2014年11月5日，上午9:29，Corey Nolet  写道：

> I'm fairly new to spark and I'm trying to kick the tires with a few 
> InputFormats. I noticed the sc.hadoopRDD() method takes a mapred JobConf 
> instead of a MapReduce Job object. Is there future planned support for the 
> mapreduce packaging?
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: spark_ec2.py for AWS region: cn-north-1, China

2014-11-04 Thread Nicholas Chammas

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html

cn-north-1 is not a supported region for EC2, as far as I can tell. There
may be other AWS services that can use that region, but spark-ec2 relies on
EC2.

Nick

On Tue, Nov 4, 2014 at 8:09 PM, haitao .yao  wrote:

> Hi,
>Amazon aws started to provide service for China mainland, the region
> name is cn-north-1. But the script spark provides: spark_ec2.py will query
> ami id from https://github.com/mesos/spark-ec2/tree/v4/ami-list and
> there's no ami information for cn-north-1 region .
>Can anybody update the ami information and update the reo:
> https://github.com/mesos/spark-ec2.git ?
>
>Thanks.
>
> --
> haitao.yao
>
>
>
>

Re: Using SQL statements vs. SchemaRDD methods

2014-11-04 Thread Michael Armbrust

They both compile down to the same logical plans so the performance of
running the query should be the same.  The Scala DSL uses a lot of Scala
magic and thus is experimental where as HiveQL is pretty set in stone.

On Tue, Nov 4, 2014 at 5:22 PM, SK  wrote:

> SchemaRDD  supports some of the SQL-like functionality like groupBy(),
> distinct(), select(). However, SparkSQL also supports SQL statements which
> provide this functionality. In terms of future support and performance, is
> it better to use SQL statements or the SchemaRDD methods that provide
> equivalent functionality?
>
> thanks
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Using-SQL-statements-vs-SchemaRDD-methods-tp18124.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Spark v Redshift

2014-11-04 Thread agfung

Sounds like context would help, I just didn't want to subject people to a
wall of text if it wasn't necessary :)

Currently we use neither Spark SQL (or anything in the Hadoop stack) or
Redshift.  We service templated queries from the appserver, i.e. user fills
out some forms, dropdowns: we translate to a query.

Data is "basically" one table containing thousands of independent time
series, with one or two tables of reference data to join to.  e.g. median
value of Field1 from Table1 where Field2 from Table 2 matches X filter, T1
and T2 joining on a surrogate key, group by a different Field3.  The data
structure is a little bit dynamic.  User can upload any CSV, as long as they
tell us the name of each column and the programmatic type.  The target data
size is about a billion records, 20'ish fields, distributed throughout a
year (about 50GB on disk as CSV, uncompressed).

So we're currently doing "historical" analytics (e.g. see analytic results
of only yesterday's data or older, but want to see the result "quickly").  
We eventually intend to do "realtime" (or "streaming") analytics (i.e. see
the impact of new data on analytics "quickly").  Machine learning is also on
the roadmap.

One proposition is for Spark SQL as a complete replacement for Redshift.  It
would simplify the architecture, since our long term strategy is to handle
data intake and ETL on HDFS (regardless of Redshift or Spark SQL).  The
other parts of the Hadoop family that would come into play for ETL is
undetermined right now.  Spark SQL appears to have relational ability, and
if we're going to use the Hadoop stack for ML and streaming analytics, and
it has the ability, why not do it all on one stack and not shovel data
around?  Also, lots of people talking about it.

The other proposition is Redshift as the historical analytics solution, and
something else (could be Spark, doesn't matter) for streaming analytics and
ML.   If we need to relate the two, we'll have an API or process to stitch
it together.   I've read about the "lambda architecture", which more or less
describes this approach.  The motivation is Redshift has the AWS
reliability/scalability/operational concerns worked out, richer query
language (SQL and pgsql functions are designed for slice-n-dice analytics)
so we can spend our coding time elsewhere, and a measure of safety against
design issues and bugs: Spark just came out of incubator status this year,
and it's much easier to find people on the web raving positively about
Redshift in real-world usage (i.e. part of live, client-facing system) than
Spark.

category_theory's observation that most of the speed comes from fitting in
memory is helpful.  It's what I would have surmised from the AMPLab Big Data
benchmark, but confirmation from the hands-on community is invaluable, thank
you.

I understand a lot of it simply has to do with what-do-you-value-more
weightings, and we'll do prototypes/benchmarks if we have to, just wasn't
sure if there were any other "key assumptions/requirements/gotchas" to
consider.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-v-Redshift-tp18112p18127.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Why mapred for the HadoopRDD?

2014-11-04 Thread Corey Nolet

I'm fairly new to spark and I'm trying to kick the tires with a few
InputFormats. I noticed the sc.hadoopRDD() method takes a mapred JobConf
instead of a MapReduce Job object. Is there future planned support for the
mapreduce packaging?

Re: netty on classpath when using spark-submit

2014-11-04 Thread Tobias Pfeiffer

Markus,

thanks for your help!

On Tue, Nov 4, 2014 at 8:33 PM, M. Dale  wrote:

>  Tobias,
>From http://spark.apache.org/docs/latest/configuration.html it seems
> that there is an experimental property:
>
> spark.files.userClassPathFirst
>

Thank you very much, I didn't know about this.  Unfortunately, it doesn't
change anything.  With this setting both true and false (as indicated by
the Spark web interface) and no matter whether "local[N]" or "yarn-client"
or "yarn-cluster" mode are used with spark-submit, the classpath looks the
same and the netty class is loaded from the Spark jar. Can I use this
setting with spark-submit at all?

Thanks
Tobias

Using SQL statements vs. SchemaRDD methods

2014-11-04 Thread SK

SchemaRDD  supports some of the SQL-like functionality like groupBy(),
distinct(), select(). However, SparkSQL also supports SQL statements which
provide this functionality. In terms of future support and performance, is
it better to use SQL statements or the SchemaRDD methods that provide
equivalent functionality? 

thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Using-SQL-statements-vs-SchemaRDD-methods-tp18124.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

spark_ec2.py for AWS region: cn-north-1, China

2014-11-04 Thread haitao .yao

Hi,
   Amazon aws started to provide service for China mainland, the region
name is cn-north-1. But the script spark provides: spark_ec2.py will query
ami id from https://github.com/mesos/spark-ec2/tree/v4/ami-list and there's
no ami information for cn-north-1 region .
   Can anybody update the ami information and update the reo:
https://github.com/mesos/spark-ec2.git ?

   Thanks.

-- 
haitao.yao

Re: deploying a model built in mllib

2014-11-04 Thread Simon Chan

The latest version of PredictionIO, which is now under Apache 2 license,
supports the deployment of MLlib models on production.

The "engine" you build will including a few components, such as:
- Data - includes Data Source and Data Preparator
- Algorithm(s)
- Serving
I believe that you can do the feature vector creation inside the Data
Preparator component.

Currently, the package comes with two templates: 1)  Collaborative
Filtering Engine Template - with MLlib ALS; 2) Classification Engine
Template - with MLlib Naive Bayes. The latter one may be useful to you. And
you can customize the Algorithm component, too.

I have just created a doc: http://docs.prediction.io/0.8.1/templates/
Love to hear your feedback!

Regards,
Simon



On Mon, Oct 27, 2014 at 11:03 AM, chirag lakhani 
wrote:

> Would pipelining include model export?  I didn't see that in the
> documentation.
>
> Are there ways that this is being done currently?
>
>
>
> On Mon, Oct 27, 2014 at 12:39 PM, Xiangrui Meng  wrote:
>
>> We are working on the pipeline features, which would make this
>> procedure much easier in MLlib. This is still a WIP and the main JIRA
>> is at:
>>
>> https://issues.apache.org/jira/browse/SPARK-1856
>>
>> Best,
>> Xiangrui
>>
>> On Mon, Oct 27, 2014 at 8:56 AM, chirag lakhani
>>  wrote:
>> > Hello,
>> >
>> > I have been prototyping a text classification model that my company
>> would
>> > like to eventually put into production.  Our technology stack is
>> currently
>> > Java based but we would like to be able to build our models in
>> Spark/MLlib
>> > and then export something like a PMML file which can be used for model
>> > scoring in real-time.
>> >
>> > I have been using scikit learn where I am able to take the training data
>> > convert the text data into a sparse data format and then take the other
>> > features and use the dictionary vectorizer to do one-hot encoding for
>> the
>> > other categorical variables.  All of those things seem to be possible in
>> > mllib but I am still puzzled about how that can be packaged in such a
>> way
>> > that the incoming data can be first made into feature vectors and then
>> > evaluated as well.
>> >
>> > Are there any best practices for this type of thing in Spark?  I hope
>> this
>> > is clear but if there are any confusions then please let me know.
>> >
>> > Thanks,
>> >
>> > Chirag
>>
>
>

RE: Workers not registering after master restart

2014-11-04 Thread Ashic Mahtab

Hi Nan,Cool. Thanks.
Regards,Ashic.
Date: Tue, 4 Nov 2014 18:26:48 -0500
From: zhunanmcg...@gmail.com
To: as...@live.com
CC: user@spark.apache.org
Subject: Re: Workers not registering after master restart

Hi, Ashic, 
this is expected for the latest released version

However, workers should be able to re-register since 1.2, since this patch 
https://github.com/apache/spark/pull/2828 was merged
Best,

-- Nan Zhu

On Tuesday, November 4, 2014 at 6:00 PM, Ashic Mahtab wrote:

Hi,I've set up a standalone Spark master (no failover or file recovery 
specified), and brought up a few worker nodes. All of them registered and were 
shown in the master web UI. I then stopped and started the master service (the 
workers were still running). After the master started up, I checked the web UI 
and none of the workers were registered. I then stopped and started each worker 
and they registered with the master again.
My question is - is this expected? Is there a timeout after which the worker 
would have rejoined the master? Or is the only way to ensure workers rejoin is 
to run master failover  or file based recovery for the master?
Thanks,Ashic.

Re: Spark v Redshift

2014-11-04 Thread Akshar Dave

There is no one size fits all solution available in the market today. If
somebody tell you they do then they are simply lying :)

Both solutions cater to different set of problems. My recommendation is to
put real focus on getting better understanding of your problems that you
are trying to solve with Spark and Redshift and pick tool based on how
effectively they handle those problems. Like Matei said, both might be
relevant in some cases.

Thanks
Akshar


On Tue, Nov 4, 2014 at 4:00 PM, Jimmy McErlain  wrote:

> This is pretty spot on.. though I would also add that the Spark features
> that it touts around speed are all dependent on caching the data into
> memory... reading off the disk still takes time..ie pulling the data into
> an RDD.  This is the reason that Spark is great for ML... the data is used
> over and over again to fit models so its pulled into memory once then
> basically analyzed through the algos... other DBs systems are reading and
> writing to disk repeatedly and are thus slower, such as mahout (though its
> getting ported over to Spark as well to compete with MLlib)...
>
> J
> ᐧ
>
>
>
>
> *JIMMY MCERLAIN*
>
> DATA SCIENTIST (NERD)
>
> *. . . . . . . . . . . . . . . . . .*
>
>
> *IF WE CAN’T DOUBLE YOUR SALES,*
>
>
>
> *ONE OF US IS IN THE WRONG BUSINESS.*
>
> *E*: ji...@sellpoints.com
>
> *M*: *510.303.7751 <510.303.7751>*
>
> On Tue, Nov 4, 2014 at 3:51 PM, Matei Zaharia 
> wrote:
>
>> Is this about Spark SQL vs Redshift, or Spark in general? Spark in
>> general provides a broader set of capabilities than Redshift because it has
>> APIs in general-purpose languages (Java, Scala, Python) and libraries for
>> things like machine learning and graph processing. For example, you might
>> use Spark to do the ETL that will put data into a database such as
>> Redshift, or you might pull data out of Redshift into Spark for machine
>> learning. On the other hand, if *all* you want to do is SQL and you are
>> okay with the set of data formats and features in Redshift (i.e. you can
>> express everything using its UDFs and you have a way to get data in), then
>> Redshift is a complete service which will do more management out of the box.
>>
>> Matei
>>
>> > On Nov 4, 2014, at 3:11 PM, agfung  wrote:
>> >
>> > I'm in the midst of a heated debate about the use of Redshift v Spark
>> with a
>> > colleague.  We keep trading anecdotes and links back and forth (eg
>> airbnb
>> > post from 2013 or amplab benchmarks), and we don't seem to be getting
>> > anywhere.
>> >
>> > So before we start down the prototype /benchmark road, and in
>> desperation
>> > of finding *some* kind of objective third party perspective,  was
>> wondering
>> > if anyone who has used both in 2014 would care to provide commentary
>> about
>> > the sweet spot use cases / gotchas for non trivial use (eg a simple
>> filter
>> > scan isn't really interesting).  Soft issues like operational
>> maintenance
>> > and time spent developing v out of the box are interesting too...
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-v-Redshift-tp18112.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> >
>> > -
>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: user-h...@spark.apache.org
>> >
>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


-- 
Akshar Dave
Principal – Big Data
SoftNet Solutions
Office: 408.542.0888 | Mobile: 408.896.1486
940 Hamlin Court, Sunnyvale, CA 94089
www.softnets.com/bigdata

Re: Spark v Redshift

2014-11-04 Thread Jimmy McErlain

This is pretty spot on.. though I would also add that the Spark features
that it touts around speed are all dependent on caching the data into
memory... reading off the disk still takes time..ie pulling the data into
an RDD.  This is the reason that Spark is great for ML... the data is used
over and over again to fit models so its pulled into memory once then
basically analyzed through the algos... other DBs systems are reading and
writing to disk repeatedly and are thus slower, such as mahout (though its
getting ported over to Spark as well to compete with MLlib)...

J
ᐧ




*JIMMY MCERLAIN*

DATA SCIENTIST (NERD)

*. . . . . . . . . . . . . . . . . .*


*IF WE CAN’T DOUBLE YOUR SALES,*



*ONE OF US IS IN THE WRONG BUSINESS.*

*E*: ji...@sellpoints.com

*M*: *510.303.7751*

On Tue, Nov 4, 2014 at 3:51 PM, Matei Zaharia 
wrote:

> Is this about Spark SQL vs Redshift, or Spark in general? Spark in general
> provides a broader set of capabilities than Redshift because it has APIs in
> general-purpose languages (Java, Scala, Python) and libraries for things
> like machine learning and graph processing. For example, you might use
> Spark to do the ETL that will put data into a database such as Redshift, or
> you might pull data out of Redshift into Spark for machine learning. On the
> other hand, if *all* you want to do is SQL and you are okay with the set of
> data formats and features in Redshift (i.e. you can express everything
> using its UDFs and you have a way to get data in), then Redshift is a
> complete service which will do more management out of the box.
>
> Matei
>
> > On Nov 4, 2014, at 3:11 PM, agfung  wrote:
> >
> > I'm in the midst of a heated debate about the use of Redshift v Spark
> with a
> > colleague.  We keep trading anecdotes and links back and forth (eg airbnb
> > post from 2013 or amplab benchmarks), and we don't seem to be getting
> > anywhere.
> >
> > So before we start down the prototype /benchmark road, and in desperation
> > of finding *some* kind of objective third party perspective,  was
> wondering
> > if anyone who has used both in 2014 would care to provide commentary
> about
> > the sweet spot use cases / gotchas for non trivial use (eg a simple
> filter
> > scan isn't really interesting).  Soft issues like operational maintenance
> > and time spent developing v out of the box are interesting too...
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-v-Redshift-tp18112.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Spark v Redshift

2014-11-04 Thread Matei Zaharia

BTW while I haven't actually used Redshift, I've seen many companies that use 
both, usually using Spark for ETL and advanced analytics and Redshift for SQL 
on the cleaned / summarized data. Xiangrui Meng also wrote 
https://github.com/mengxr/redshift-input-format to make it easy to read data 
exported from Redshift into Spark or Hadoop.

Matei

> On Nov 4, 2014, at 3:51 PM, Matei Zaharia  wrote:
> 
> Is this about Spark SQL vs Redshift, or Spark in general? Spark in general 
> provides a broader set of capabilities than Redshift because it has APIs in 
> general-purpose languages (Java, Scala, Python) and libraries for things like 
> machine learning and graph processing. For example, you might use Spark to do 
> the ETL that will put data into a database such as Redshift, or you might 
> pull data out of Redshift into Spark for machine learning. On the other hand, 
> if *all* you want to do is SQL and you are okay with the set of data formats 
> and features in Redshift (i.e. you can express everything using its UDFs and 
> you have a way to get data in), then Redshift is a complete service which 
> will do more management out of the box.
> 
> Matei
> 
>> On Nov 4, 2014, at 3:11 PM, agfung  wrote:
>> 
>> I'm in the midst of a heated debate about the use of Redshift v Spark with a
>> colleague.  We keep trading anecdotes and links back and forth (eg airbnb
>> post from 2013 or amplab benchmarks), and we don't seem to be getting
>> anywhere. 
>> 
>> So before we start down the prototype /benchmark road, and in desperation 
>> of finding *some* kind of objective third party perspective,  was wondering
>> if anyone who has used both in 2014 would care to provide commentary about
>> the sweet spot use cases / gotchas for non trivial use (eg a simple filter
>> scan isn't really interesting).  Soft issues like operational maintenance
>> and time spent developing v out of the box are interesting too... 
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-v-Redshift-tp18112.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>> 
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark v Redshift

2014-11-04 Thread Matei Zaharia

Is this about Spark SQL vs Redshift, or Spark in general? Spark in general 
provides a broader set of capabilities than Redshift because it has APIs in 
general-purpose languages (Java, Scala, Python) and libraries for things like 
machine learning and graph processing. For example, you might use Spark to do 
the ETL that will put data into a database such as Redshift, or you might pull 
data out of Redshift into Spark for machine learning. On the other hand, if 
*all* you want to do is SQL and you are okay with the set of data formats and 
features in Redshift (i.e. you can express everything using its UDFs and you 
have a way to get data in), then Redshift is a complete service which will do 
more management out of the box.

Matei

> On Nov 4, 2014, at 3:11 PM, agfung  wrote:
> 
> I'm in the midst of a heated debate about the use of Redshift v Spark with a
> colleague.  We keep trading anecdotes and links back and forth (eg airbnb
> post from 2013 or amplab benchmarks), and we don't seem to be getting
> anywhere. 
> 
> So before we start down the prototype /benchmark road, and in desperation 
> of finding *some* kind of objective third party perspective,  was wondering
> if anyone who has used both in 2014 would care to provide commentary about
> the sweet spot use cases / gotchas for non trivial use (eg a simple filter
> scan isn't really interesting).  Soft issues like operational maintenance
> and time spent developing v out of the box are interesting too... 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-v-Redshift-tp18112.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How to ship cython library to workers?

2014-11-04 Thread freedafeng

Thanks for the solution! I did figure out how to create an .egg file to ship
out to the workers. Using ipython seems to be another cool solution.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-ship-cython-library-to-workers-tp14467p18116.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RE: stackoverflow error

2014-11-04 Thread Hongbin Liu

Sorry, I have to change rank/lambda/iteration to the following

val ranks = List(100, 200, 400)
val lambdas = List(1.0, 2.0, 4.0)
val numIters = List(50, 100, 150)


From: Hongbin Liu
Sent: Tuesday, November 04, 2014 6:14 PM
To: 'user@spark.apache.org'
Cc: Gregory Campbell
Subject: stackoverflow error

Hi, can you help with the following? We are new to spark.

Error stack:

14/11/04 18:08:03 INFO SparkContext: Job finished: count at ALS.scala:314, took 
480.318100288 s
Exception in thread "main" java.lang.StackOverflowError
at 
org.apache.spark.rdd.RDD$$anonfun$doCheckpoint$1.apply(RDD.scala:1270)
at 
org.apache.spark.rdd.RDD$$anonfun$doCheckpoint$1.apply(RDD.scala:1270)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.apache.spark.rdd.RDD.doCheckpoint(RDD.scala:1270)
at 
org.apache.spark.rdd.RDD$$anonfun$doCheckpoint$1.apply(RDD.scala:1270)
at 
org.apache.spark.rdd.RDD$$anonfun$doCheckpoint$1.apply(RDD.scala:1270)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.apache.spark.rdd.RDD.doCheckpoint(RDD.scala:1270)
at 
org.apache.spark.rdd.RDD$$anonfun$doCheckpoint$1.apply(RDD.scala:1270)
at 
org.apache.spark.rdd.RDD$$anonfun$doCheckpoint$1.apply(RDD.scala:1270)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.apache.spark.rdd.RDD.doCheckpoint(RDD.scala:1270)
at 
org.apache.spark.rdd.RDD$$anonfun$doCheckpoint$1.apply(RDD.scala:1270)
at 
org.apache.spark.rdd.RDD$$anonfun$doCheckpoint$1.apply(RDD.scala:1270)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.apache.spark.rdd.RDD.doCheckpoint(RDD.scala:1270)
at 
org.apache.spark.rdd.RDD$$anonfun$doCheckpoint$1.apply(RDD.scala:1270)
at 
org.apache.spark.rdd.RDD$$anonfun$doCheckpoint$1.apply(RDD.scala:1270)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.apache.spark.rdd.RDD.doCheckpoint(RDD.scala:1270)
at 
org.apache.spark.rdd.RDD$$anonfun$doCheckpoint$1.apply(RDD.scala:1270)
at 
org.apache.spark.rdd.RDD$$anonfun$doCheckpoint$1.apply(RDD.scala:1270)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.apache.spark.rdd.RDD.doCheckpoint(RDD.scala:1270)
at 
org.apache.spark.rdd.RDD$$anonfun$doCheckpoint$1.apply(RDD.scala:1270)
at 
org.apache.spark.rdd.RDD$$anonfun$doCheckpoint$1.apply(RDD.scala:1270)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.apache.spark.rdd.RDD.doCheckpoint(RDD.scala:1270)
at 
org.apache.spark.rdd.RDD$$anonfun$doCheckpoint$1.apply(RDD.scala:1270)
at 
org.apache.spark.rdd.RDD$$anonfun$doCheckpoint$1.apply(RDD.scala:1270)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.apache.spark.rdd.RDD.doCheckpoint(RDD.scala:1270)
at 
org.apache.spark.rdd.RDD$$anonfun$doCheckpoint$1.apply(RDD.scala:1270)
at 
org.apache.spark.rdd.RDD$$anonfun$doCheckpoint$1.apply(RDD.scala:1270)
at scala.collection.immutable.List.foreach(List.scala:318)


The code that generates the error I think is

val ranks = List(100, 120)
val lambdas = List(1.0, 10.0)
val numIters = List(10, 20)
var bestModel: Option[MatrixFactorizationModel] = None
var bestValidationRmse = Double.MaxValue
var bestRank = 0
var bestLambda = -1.0
var bestNumIter = -1
for (rank <- ranks; lambda <- lambdas; numIter <- numIters) {
val model = ALS.train(training, rank, numIter, lambda)
val validationRmse = computeRmse(model, validation, numValidation)
appendToFile("rank-lamda-it-rmse.txt", rank + "," + lamda + "," + 
numIter + ", " + validationRmse)
if (validationRmse < bestValidationRmse) {
bestModel = Some(model)
bestValidationRmse = validationRmse
bestRank = rank
bestLambda = lambda
bestNumIter = numIter
}
}


This message may contain confidential information and is intended for specific 
recipients unless explicitly noted otherwise. If you have reason to believe you 
are not an intended recipient of this message, please delete it and notify the 
sender. This message may not represent the opinion of Intercontinental 
Exchange, Inc. (ICE), its subsidiaries or affiliates, and does not constitute a 
contract or guarantee. Unencrypted electronic mail is not secure and the 
recipient of this message is expected to provide safeguards from viruses and 
pursue alternate means of communication where privacy or a binding message is 
desired.

stackoverflow error

2014-11-04 Thread Hongbin Liu

Hi, can you help with the following? We are new to spark.

Error stack:

14/11/04 18:08:03 INFO SparkContext: Job finished: count at ALS.scala:314, took 
480.318100288 s
Exception in thread "main" java.lang.StackOverflowError
at 
org.apache.spark.rdd.RDD$$anonfun$doCheckpoint$1.apply(RDD.scala:1270)
at 
org.apache.spark.rdd.RDD$$anonfun$doCheckpoint$1.apply(RDD.scala:1270)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.apache.spark.rdd.RDD.doCheckpoint(RDD.scala:1270)
at 
org.apache.spark.rdd.RDD$$anonfun$doCheckpoint$1.apply(RDD.scala:1270)
at 
org.apache.spark.rdd.RDD$$anonfun$doCheckpoint$1.apply(RDD.scala:1270)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.apache.spark.rdd.RDD.doCheckpoint(RDD.scala:1270)
at 
org.apache.spark.rdd.RDD$$anonfun$doCheckpoint$1.apply(RDD.scala:1270)
at 
org.apache.spark.rdd.RDD$$anonfun$doCheckpoint$1.apply(RDD.scala:1270)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.apache.spark.rdd.RDD.doCheckpoint(RDD.scala:1270)
at 
org.apache.spark.rdd.RDD$$anonfun$doCheckpoint$1.apply(RDD.scala:1270)
at 
org.apache.spark.rdd.RDD$$anonfun$doCheckpoint$1.apply(RDD.scala:1270)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.apache.spark.rdd.RDD.doCheckpoint(RDD.scala:1270)
at 
org.apache.spark.rdd.RDD$$anonfun$doCheckpoint$1.apply(RDD.scala:1270)
at 
org.apache.spark.rdd.RDD$$anonfun$doCheckpoint$1.apply(RDD.scala:1270)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.apache.spark.rdd.RDD.doCheckpoint(RDD.scala:1270)
at 
org.apache.spark.rdd.RDD$$anonfun$doCheckpoint$1.apply(RDD.scala:1270)
at 
org.apache.spark.rdd.RDD$$anonfun$doCheckpoint$1.apply(RDD.scala:1270)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.apache.spark.rdd.RDD.doCheckpoint(RDD.scala:1270)
at 
org.apache.spark.rdd.RDD$$anonfun$doCheckpoint$1.apply(RDD.scala:1270)
at 
org.apache.spark.rdd.RDD$$anonfun$doCheckpoint$1.apply(RDD.scala:1270)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.apache.spark.rdd.RDD.doCheckpoint(RDD.scala:1270)
at 
org.apache.spark.rdd.RDD$$anonfun$doCheckpoint$1.apply(RDD.scala:1270)
at 
org.apache.spark.rdd.RDD$$anonfun$doCheckpoint$1.apply(RDD.scala:1270)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.apache.spark.rdd.RDD.doCheckpoint(RDD.scala:1270)
at 
org.apache.spark.rdd.RDD$$anonfun$doCheckpoint$1.apply(RDD.scala:1270)
at 
org.apache.spark.rdd.RDD$$anonfun$doCheckpoint$1.apply(RDD.scala:1270)
at scala.collection.immutable.List.foreach(List.scala:318)


The code that generates the error I think is

val ranks = List(100, 120)
val lambdas = List(1.0, 10.0)
val numIters = List(10, 20)
var bestModel: Option[MatrixFactorizationModel] = None
var bestValidationRmse = Double.MaxValue
var bestRank = 0
var bestLambda = -1.0
var bestNumIter = -1
for (rank <- ranks; lambda <- lambdas; numIter <- numIters) {
val model = ALS.train(training, rank, numIter, lambda)
val validationRmse = computeRmse(model, validation, numValidation)
appendToFile("rank-lamda-it-rmse.txt", rank + "," + lamda + "," + 
numIter + ", " + validationRmse)
if (validationRmse < bestValidationRmse) {
bestModel = Some(model)
bestValidationRmse = validationRmse
bestRank = rank
bestLambda = lambda
bestNumIter = numIter
}
}


This message may contain confidential information and is intended for specific 
recipients unless explicitly noted otherwise. If you have reason to believe you 
are not an intended recipient of this message, please delete it and notify the 
sender. This message may not represent the opinion of Intercontinental 
Exchange, Inc. (ICE), its subsidiaries or affiliates, and does not constitute a 
contract or guarantee. Unencrypted electronic mail is not secure and the 
recipient of this message is expected to provide safeguards from viruses and 
pursue alternate means of communication where privacy or a binding message is 
desired.

Re: Workers not registering after master restart

2014-11-04 Thread Nan Zhu

Hi, Ashic, 

this is expected for the latest released version 

However, workers should be able to re-register since 1.2, since this patch 
https://github.com/apache/spark/pull/2828 was merged

Best, 

-- 
Nan Zhu


On Tuesday, November 4, 2014 at 6:00 PM, Ashic Mahtab wrote:

> Hi,
> I've set up a standalone Spark master (no failover or file recovery 
> specified), and brought up a few worker nodes. All of them registered and 
> were shown in the master web UI. I then stopped and started the master 
> service (the workers were still running). After the master started up, I 
> checked the web UI and none of the workers were registered. I then stopped 
> and started each worker and they registered with the master again.
> 
> My question is - is this expected? Is there a timeout after which the worker 
> would have rejoined the master? Or is the only way to ensure workers rejoin 
> is to run master failover  or file based recovery for the master?
> 
> Thanks,
> Ashic.
> 
> 
>

Spark v Redshift

2014-11-04 Thread agfung

I'm in the midst of a heated debate about the use of Redshift v Spark with a
colleague.  We keep trading anecdotes and links back and forth (eg airbnb
post from 2013 or amplab benchmarks), and we don't seem to be getting
anywhere. 

So before we start down the prototype /benchmark road, and in desperation 
of finding *some* kind of objective third party perspective,  was wondering
if anyone who has used both in 2014 would care to provide commentary about
the sweet spot use cases / gotchas for non trivial use (eg a simple filter
scan isn't really interesting).  Soft issues like operational maintenance
and time spent developing v out of the box are interesting too... 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-v-Redshift-tp18112.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Workers not registering after master restart

2014-11-04 Thread Ashic Mahtab

Hi,I've set up a standalone Spark master (no failover or file recovery 
specified), and brought up a few worker nodes. All of them registered and were 
shown in the master web UI. I then stopped and started the master service (the 
workers were still running). After the master started up, I checked the web UI 
and none of the workers were registered. I then stopped and started each worker 
and they registered with the master again.
My question is - is this expected? Is there a timeout after which the worker 
would have rejoined the master? Or is the only way to ensure workers rejoin is 
to run master failover  or file based recovery for the master?
Thanks,Ashic.

Re: spark sql create nested schema

2014-11-04 Thread Yin Huai

Hello Tridib,

For you case, you can use StructType(StructField("ParentInfo", parentInfo,
true) :: StructField("ChildInfo", childInfo, true) :: Nil) to create the
StructType representing the schema (parentInfo and childInfo are two
existing StructTypes). You can take a look at our docs (
http://spark.apache.org/docs/1.1.0/sql-programming-guide.html#programmatically-specifying-the-schema
 and
http://spark.apache.org/docs/1.1.0/api/scala/index.html#org.apache.spark.sql.package
).

Thanks,

Yin

On Tue, Nov 4, 2014 at 3:19 PM, tridib  wrote:

> I am trying to create a schema which will look like:
>
> root
>  |-- ParentInfo: struct (nullable = true)
>  ||-- ID: string (nullable = true)
>  ||-- State: string (nullable = true)
>  ||-- Zip: string (nullable = true)
>  |-- ChildInfo: struct (nullable = true)
>  ||-- ID: string (nullable = true)
>  ||-- State: string (nullable = true)
>  ||-- Hobby: string (nullable = true)
>  ||-- Zip: string (nullable = true)
>
> How do I create a StructField of StructType? I think that's what the "root"
> is.
>
> Thanks & Regards
> Tridib
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-create-nested-schema-tp18090.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Streaming window operations not producing output

2014-11-04 Thread Tathagata Das

Didnt oyu get any errors in the log4j logs, saying that you have to enable
checkpointing?

TD

On Tue, Nov 4, 2014 at 7:20 AM, diogo  wrote:

> So, to answer my own n00b question, if case anyone ever needs it. You have
> to enable checkpointing (by ssc.checkpoint(hdfsPath)). Windowed
> operations need to be *checkpointed*, otherwise windows just won't work
> (and how could they).
>
> On Tue, Oct 28, 2014 at 10:24 AM, diogo  wrote:
>
>> Hi there, I'm trying to use Window operations on streaming, but
>> everything I perform a windowed computation, I stop getting results.
>>
>> For example:
>>
>> val wordCounts = pairs.reduceByKey(_ + _)
>> wordCounts.print()
>>
>> Will print the output to the stdout on 'batch duration' interval. Now if
>> I replace it with:
>>
>> val wordCounts = pairs.reduceByKeyAndWindow(_+_, _-_, Seconds(4),
>> Seconds(2))
>> wordCounts.print()
>>
>> It will never output. What did I get wrong?
>>
>> Thanks.
>>
>
>

Re: How to make sure a ClassPath is always shipped to workers?

2014-11-04 Thread Peng Cheng

Thanks a lot! Unfortunately this is not my problem: The page class is already
in the jar that is shipped to every worker. (I've logged into workers and
unpacked the jar files, and see the class file right there as intended)
Also, this error only happens sporadically, not every time. the error was
sometimes automatically retried on a different worker and get overcome (but
it won't be overcome if retried manually in the same partition), which make
it hard to catch.

Yours Peng



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-make-sure-a-ClassPath-is-always-shipped-to-workers-tp18018p18107.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Model characterization

2014-11-04 Thread vinay453

Go it from a friend -  println(model.weights) and println(model.intercept). 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Model-characterization-tp17985p18106.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: scala RDD sortby compilation error

2014-11-04 Thread Sean Owen

That works for me in the shell, at least, without the streaming bit
and whatever other code you had before this. Did you import
scala.reflect.classTag? I think you'd get a different error if not.
Maybe remove the "foreachFunc ="?

On Tue, Nov 4, 2014 at 9:11 PM, Josh J  wrote:
> Please find my code here.
>
> On Tue, Nov 4, 2014 at 11:33 AM, Josh J  wrote:
>>
>> I'm using the same code, though still receive
>>
>>  not enough arguments for method sortBy: (f: String => K, ascending:
>> Boolean, numPartitions: Int)(implicit ord: Ordering[K], implicit ctag:
>> scala.reflect.ClassTag[K])org.apache.spark.rdd.RDD[String].
>>
>> Unspecified value parameter f.
>>
>>
>> On Tue, Nov 4, 2014 at 11:28 AM, Josh J  wrote:
>>>
>>> Hi,
>>>
>>> Does anyone have any good examples of using sortby for RDDs and scala?
>>>
>>> I'm receiving
>>>
>>>  not enough arguments for method sortBy: (f: String => K, ascending:
>>> Boolean, numPartitions: Int)(implicit ord: Ordering[K], implicit ctag:
>>> scala.reflect.ClassTag[K])org.apache.spark.rdd.RDD[String].
>>>
>>> Unspecified value parameter f.
>>>
>>>
>>> I tried to follow the example in the test case by using the same approach
>>> even same method names and parameters though no luck.
>>>
>>>
>>> Thanks,
>>>
>>> Josh
>>
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: ERROR UserGroupInformation: PriviledgedActionException

2014-11-04 Thread Saiph Kappa

I set the host and port of the driver and now the error slightly changed

Using Spark's default log4j profile:
> org/apache/spark/log4j-defaults.properties
> 14/11/04 21:13:48 INFO CoarseGrainedExecutorBackend: Registered signal
> handlers for [TERM, HUP, INT]
> 14/11/04 21:13:48 INFO SecurityManager: Changing view acls to:
> myuser,Myuser
> 14/11/04 21:13:48 INFO SecurityManager: Changing modify acls to:
> myuser,Myuser
> 14/11/04 21:13:48 INFO SecurityManager: SecurityManager: authentication
> disabled; ui acls disabled; users with view permissions: Set(myuser,
> Myuser); users with modify permissions: Set(myuser, Myuser)
> 14/11/04 21:13:48 INFO Slf4jLogger: Slf4jLogger started
> 14/11/04 21:13:48 INFO Remoting: Starting remoting
> 14/11/04 21:13:49 INFO Remoting: Remoting started; listening on addresses
> :[akka.tcp://driverPropsFetcher@myserver:37456]
> 14/11/04 21:13:49 INFO Remoting: Remoting now listens on addresses:
> [akka.tcp://driverPropsFetcher@myserver:37456]
> 14/11/04 21:13:49 INFO Utils: Successfully started service
> 'driverPropsFetcher' on port 37456.
> 14/11/04 21:14:19 ERROR UserGroupInformation: PriviledgedActionException
> as:Myuser cause:java.util.concurrent.TimeoutException: Futures timed out
> after [30 seconds]
> Exception in thread "main" java.lang.reflect.UndeclaredThrowableException:
> Unknown exception in doAs
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1134)
> at
> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:52)
> at
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:113)
> at
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:156)
> at
> org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
> Caused by: java.security.PrivilegedActionException:
> java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
> ... 4 more
> Caused by: java.util.concurrent.TimeoutException: Futures timed out after
> [30 seconds]
> at
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
> at
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
> at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
> at
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
> at scala.concurrent.Await$.result(package.scala:107)
> at
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:125)
> at
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:53)
> at
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:52)
> ... 7 more
>

Any ideas?

Thanks.

On Tue, Nov 4, 2014 at 11:29 AM, Akhil Das 
wrote:

> If you want to run the spark application from a remote machine, then you
> have to at least set the following configurations properly.
>
> *spark.driver.host* - points to the ip/host from where you are submitting
> the job (make sure you are able to ping this from the cluster)
>
> *spark.driver.port* - set it to a port number which is accessible from
> the spark cluster.
>
> You can look at more configuration options over here.
> 
>
> Thanks
> Best Regards
>
> On Tue, Nov 4, 2014 at 6:07 AM, Saiph Kappa  wrote:
>
>> Hi,
>>
>> I am trying to submit a job to a spark cluster running on a single
>> machine (1 master + 1 worker) with hadoop 1.0.4. I submit it in the code:
>> «val sparkConf = new
>> SparkConf().setMaster("spark://myserver:7077").setAppName("MyApp").setJars(Array("target/my-app-1.0-SNAPSHOT.jar"))».
>>
>> When I run this application on the same machine as the cluster everything
>> works fine.
>>
>> But when I run it from a remote machine I get the following error:
>>
>> Using Spark's default log4j profile:
>>> org/apache/spark/log4j-defaults.properties
>>> 14/11/04 00:15:38 INFO CoarseGrainedExecutorBackend: Registered signal
>>> handlers for [TERM, HUP, INT]
>>> 14/11/04 00:15:38 INFO SecurityManager: Changing view acls to:
>>> myuser,Myuser
>>> 14/11/04 00:15:38 INFO SecurityManager: Changing modify acls to:
>>> myuser,Myuser
>>> 14/11/04 00:15:38 INFO SecurityManager: SecurityManager: authentication
>>> disabled; ui acls disabled; users with view permissions: Set(myuser,
>>> Myuser); users with modify permissions: Set(myuser, Myuser)
>>> 14/11/04 00:15:38 INFO Slf4jLogger: Slf4jLogger started
>>> 14/11/04 00:15:38 INFO Remoting: Starting remoting
>>> 14/11/04 00:15:38 INFO Remoting: Remoting started; listening on
>>> addresses :[akka.tcp://driverP

Re: save as JSON objects

2014-11-04 Thread Yin Huai

Hello Andrejs,

For now, you need to use a JSON lib to serialize records of your datasets
as JSON strings. In future, we will add a method to SchemaRDD to let you
write a SchemaRDD in JSON format (I have created
https://issues.apache.org/jira/browse/SPARK-4228 to track it).

Thanks,

Yin

On Tue, Nov 4, 2014 at 3:48 AM, Andrejs Abele <
andrejs.ab...@insight-centre.org> wrote:

> Hi,
> Can some one pleas sugest me, what is the best way to output spark data as
> JSON file. (File where each line is a JSON object)
> Cheers,
> Andrejs
>

Re: IllegalStateException: unread block data

2014-11-04 Thread freedafeng

problem is solved. I basically built a fat spark jar that includes all hbase
stuff and sent over the examples.jar over to the slaves too. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/IllegalStateException-unread-block-data-tp18011p18102.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: scala RDD sortby compilation error

2014-11-04 Thread Josh J

Please find my code here
.

On Tue, Nov 4, 2014 at 11:33 AM, Josh J  wrote:

> I'm using the same code
> ,
> though still receive
>
>  not enough arguments for method sortBy: (f: String => K, ascending:
> Boolean, numPartitions: Int)(implicit ord: Ordering[K], implicit ctag:
> scala.reflect.ClassTag[K])org.apache.spark.rdd.RDD[String].
>
> Unspecified value parameter f.
>
> On Tue, Nov 4, 2014 at 11:28 AM, Josh J  wrote:
>
>> Hi,
>>
>> Does anyone have any good examples of using sortby for RDDs and scala?
>>
>> I'm receiving
>>
>>  not enough arguments for method sortBy: (f: String => K, ascending:
>> Boolean, numPartitions: Int)(implicit ord: Ordering[K], implicit ctag:
>> scala.reflect.ClassTag[K])org.apache.spark.rdd.RDD[String].
>>
>> Unspecified value parameter f.
>>
>>
>> I tried to follow the example in the test case
>> 
>>  by
>> using the same approach even same method names and parameters though no
>> luck.
>>
>>
>> Thanks,
>>
>> Josh
>>
>
>

Re: Streaming: which code is (not) executed at every batch interval?

2014-11-04 Thread Steve Reinhardt

From: Sean Owen 

>Maybe you are looking for updateStateByKey?
>http://spark.apache.org/docs/latest/streaming-programming-guide.html#trans
>formations-on-dstreams
>
>You can use broadcast to efficiently send info to all the workers, if
>you have some other data that's immutable, like in a local file, that
>needs to be distributed.

The second flow of data is coming from a human, so updating rarely by
streaming standards.  I'm agnostic about how to incorporate this second
flow, it just needs to work reasonably somehow.

My first approach was to put it in a file and monitor changes to that file
(in a Linux, non-Spark way) and then disseminate it to all the nodes
somehow.  
- Since the data is mutable, at first blush broadcast() seems a poor
match.  Or is there some way for the result of the broadcast to be a new
variable each time, in the streaming code?
- Is there a way for the (Linux, non-Spark) code to read the file and then
write it into a socket (say) stream that is (redundantly, not
partitionedly (word?)) written to all the nodes?  (I.e., it is a broadcast
in that sense.)

I hope my description makes sense.


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: MEMORY_ONLY_SER question

2014-11-04 Thread Tathagata Das

It it deserialized in a streaming manner as the iterator moves over the
partition. This is a functionality of core Spark, and Spark Streaming just
uses it as is.
What do you want to customize it to?

On Tue, Nov 4, 2014 at 9:22 AM, Mohit Jaggi  wrote:

> Folks,
> If I have an RDD persisted in MEMORY_ONLY_SER mode and then it is needed
> for a transformation/action later, is the whole partition of the RDD
> deserialized into Java objects first before my transform/action code works
> on it? Or is it deserialized in a streaming manner as the iterator moves
> over the partition? Is this behavior customizable? I generally use the Kryo
> serializer.
>
> Mohit.
>

Best practice for join

2014-11-04 Thread Benyi Wang

I need to join RDD[A], RDD[B], and RDD[C]. Here is what I did,

# build (K,V) from A and B to prepare the join

val ja = A.map( r => (K1, Va))
val jb = B.map( r => (K1, Vb))

# join A, B

val jab = ja.join(jb)

# build (K,V) from the joined result of A and B to prepare joining with C

val jc = C.map(r => (K2, Vc))
jab.join(jc).map( => (K,V) ).reduceByKey(_ + _)

Because A may have multiple fields, so Va is a tuple with more than 2
fields. It is said that scala Tuple may not be specialized, and there is
boxing/unboxing issue, so I tried to use "case class" for Va, Vb, and Vc,
K2 and K which are compound keys, and V is a pair of count and ratio, _+_
will create a new ratio. I register those case classes in Kryo.

The sizes of Shuffle read/write look smaller. But I found GC overhead is
really high: GC Time is about 20~30% of duration for the reduceByKey task.
I think a lot of new objects are created using case classes during
map/reduce.

How to make the thing better?

Re: StructField of StructType

2014-11-04 Thread Michael Armbrust

Structs are Rows nested in other rows.  This might also be helpful:
http://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema

On Tue, Nov 4, 2014 at 12:21 PM, tridib  wrote:

> How do I create a StructField of StructType? I need to create a nested
> schema.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/StructField-of-StructType-tp18091.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

StructField of StructType

2014-11-04 Thread tridib

How do I create a StructField of StructType? I need to create a nested
schema.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/StructField-of-StructType-tp18091.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

spark sql create nested schema

2014-11-04 Thread tridib

I am trying to create a schema which will look like:

root
 |-- ParentInfo: struct (nullable = true)
 ||-- ID: string (nullable = true)
 ||-- State: string (nullable = true)
 ||-- Zip: string (nullable = true)
 |-- ChildInfo: struct (nullable = true)
 ||-- ID: string (nullable = true)
 ||-- State: string (nullable = true)
 ||-- Hobby: string (nullable = true)
 ||-- Zip: string (nullable = true)

How do I create a StructField of StructType? I think that's what the "root"
is.

Thanks & Regards 
Tridib



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-create-nested-schema-tp18090.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

[ANN] Spark resources searchable

2014-11-04 Thread Otis Gospodnetic

Hi everyone,

We've recently added indexing of all Spark resources to
http://search-hadoop.com/spark .

Everything is nicely searchable:
* user & dev mailing lists
* JIRA issues
* web site
* wiki
* source code
* javadoc.

Maybe it's worth adding to http://spark.apache.org/community.html ?

Enjoy!

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/

Re: Streaming: which code is (not) executed at every batch interval?

2014-11-04 Thread Sean Owen

On Tue, Nov 4, 2014 at 8:02 PM, spr  wrote:
> To state this another way, it seems like there's no way to straddle the
> streaming world and the non-streaming world;  to get input from both a
> (vanilla, Linux) file and a stream.  Is that true?
>
> If so, it seems I need to turn my (vanilla file) data into a second stream.

Hm, why do you say that? nothing prevents that at all. You can do
anything you like in your local code, or in functions you send to
remote workers. (Of course, if those functions depend on a local file,
it has to exist locally on the workers.) You do have to think about
the distributed model here, but what executes locally/remotely isn't
mysterious. It is things in calls to Spark API method that will be
executed remotely.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Streaming: which code is (not) executed at every batch interval?

2014-11-04 Thread Steve Reinhardt

-Original Message-
From: Sean Owen 

>On Tue, Nov 4, 2014 at 8:02 PM, spr  wrote:
>> To state this another way, it seems like there's no way to straddle the
>> streaming world and the non-streaming world;  to get input from both a
>> (vanilla, Linux) file and a stream.  Is that true?
>>
>> If so, it seems I need to turn my (vanilla file) data into a second
>>stream.
>
>Hm, why do you say that? nothing prevents that at all. You can do
>anything you like in your local code, or in functions you send to
>remote workers. (Of course, if those functions depend on a local file,
>it has to exist locally on the workers.) You do have to think about
>the distributed model here, but what executes locally/remotely isn't
>mysterious. It is things in calls to Spark API method that will be
>executed remotely.

The distinction I was calling out was temporal, not local/distributed,
though that is another important dimension.  It sounds like I can do
anything I want in the code before the ssc.start(), but that code runs
once at the beginning of the program.  What I'm searching for is some way
to have code that runs repeatedly and potentially updates a variable that
the Streaming code will see.  Broadcast() almost does that, but apparently
the underlying variable should be immutable.  I'm not aware of any (Spark)
way to have code run repeatedly other than as part of the Spark Streaming
API, but that doesn't look at vanilla files.

The distributed angle you raise makes my "vanilla file" approach not quite
credible, in that the vanilla file would have to be distributed to all the
nodes for the updates to be seen.  So maybe the simplest way to do that is
to have a vanilla Linux code monitoring the vanilla file (on a client
node) and sending any changes to it into a (distinct) stream.  If so, the
remote code would need to monitor both that stream and the main data
stream.  Does that make sense?

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Streaming: which code is (not) executed at every batch interval?

2014-11-04 Thread Sean Owen

Maybe you are looking for updateStateByKey?
http://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams

You can use broadcast to efficiently send info to all the workers, if
you have some other data that's immutable, like in a local file, that
needs to be distributed.

On Tue, Nov 4, 2014 at 8:38 PM, Steve Reinhardt  wrote:
>
> -Original Message-
> From: Sean Owen 
>
>>On Tue, Nov 4, 2014 at 8:02 PM, spr  wrote:
>>> To state this another way, it seems like there's no way to straddle the
>>> streaming world and the non-streaming world;  to get input from both a
>>> (vanilla, Linux) file and a stream.  Is that true?
>>>
>>> If so, it seems I need to turn my (vanilla file) data into a second
>>>stream.
>>
>>Hm, why do you say that? nothing prevents that at all. You can do
>>anything you like in your local code, or in functions you send to
>>remote workers. (Of course, if those functions depend on a local file,
>>it has to exist locally on the workers.) You do have to think about
>>the distributed model here, but what executes locally/remotely isn't
>>mysterious. It is things in calls to Spark API method that will be
>>executed remotely.
>
> The distinction I was calling out was temporal, not local/distributed,
> though that is another important dimension.  It sounds like I can do
> anything I want in the code before the ssc.start(), but that code runs
> once at the beginning of the program.  What I'm searching for is some way
> to have code that runs repeatedly and potentially updates a variable that
> the Streaming code will see.  Broadcast() almost does that, but apparently
> the underlying variable should be immutable.  I'm not aware of any (Spark)
> way to have code run repeatedly other than as part of the Spark Streaming
> API, but that doesn't look at vanilla files.
>
> The distributed angle you raise makes my "vanilla file" approach not quite
> credible, in that the vanilla file would have to be distributed to all the
> nodes for the updates to be seen.  So maybe the simplest way to do that is
> to have a vanilla Linux code monitoring the vanilla file (on a client
> node) and sending any changes to it into a (distinct) stream.  If so, the
> remote code would need to monitor both that stream and the main data
> stream.  Does that make sense?
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Streaming: which code is (not) executed at every batch interval?

2014-11-04 Thread spr

Good, thanks for the clarification.  It would be great if this were precisely
stated somewhere in the docs.  :)

To state this another way, it seems like there's no way to straddle the
streaming world and the non-streaming world;  to get input from both a
(vanilla, Linux) file and a stream.  Is that true?  

If so, it seems I need to turn my (vanilla file) data into a second stream.



sowen wrote
> Yes, code is just local Scala code unless it's invoking Spark APIs.
> The "non-Spark-streaming" block appears to just be normal program code
> executed in your driver, which ultimately starts the streaming
> machinery later. It executes once; there is nothing about that code
> connected to Spark. It's not magic.
> 
> To execute code against every RDD you use operations like foreachRDD
> on DStream to write a function that is executed at each batch interval
> on an RDD.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Streaming-which-code-is-not-executed-at-every-batch-interval-tp18071p18087.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: SparkSQL - No support for subqueries in 1.2-snapshot?

2014-11-04 Thread Terry Siu

Done.

https://issues.apache.org/jira/browse/SPARK-4226

Hoping this will make it into 1.3?  :)

-Terry

From: Michael Armbrust mailto:mich...@databricks.com>>
Date: Tuesday, November 4, 2014 at 11:31 AM
To: Terry Siu mailto:terry@smartfocus.com>>
Cc: "user@spark.apache.org" 
mailto:user@spark.apache.org>>
Subject: Re: SparkSQL - No support for subqueries in 1.2-snapshot?

This is not supported yet.  It would be great if you could open a JIRA (though 
I think apache JIRA is down ATM).

On Tue, Nov 4, 2014 at 9:40 AM, Terry Siu 
mailto:terry@smartfocus.com>> wrote:
I’m trying to execute a subquery inside an IN clause and am encountering an 
unsupported language feature in the parser.


java.lang.RuntimeException: Unsupported language features in query: select 
customerid from sparkbug where customerid in (select customerid from sparkbug 
where customerid in (2,3))

TOK_QUERY

  TOK_FROM

TOK_TABREF

  TOK_TABNAME

sparkbug

  TOK_INSERT

TOK_DESTINATION

  TOK_DIR

TOK_TMP_FILE

TOK_SELECT

  TOK_SELEXPR

TOK_TABLE_OR_COL

  customerid

TOK_WHERE

  TOK_SUBQUERY_EXPR

TOK_SUBQUERY_OP

  in

TOK_QUERY

  TOK_FROM

TOK_TABREF

  TOK_TABNAME

sparkbug

  TOK_INSERT

TOK_DESTINATION

  TOK_DIR

TOK_TMP_FILE

TOK_SELECT

  TOK_SELEXPR

TOK_TABLE_OR_COL

  customerid

TOK_WHERE

  TOK_FUNCTION

in

TOK_TABLE_OR_COL

  customerid

2

3

TOK_TABLE_OR_COL

  customerid


scala.NotImplementedError: No parse rules for ASTNode type: 817, text: 
TOK_SUBQUERY_EXPR :

TOK_SUBQUERY_EXPR

  TOK_SUBQUERY_OP

in

  TOK_QUERY

TOK_FROM

  TOK_TABREF

TOK_TABNAME

  sparkbug

TOK_INSERT

  TOK_DESTINATION

TOK_DIR

  TOK_TMP_FILE

  TOK_SELECT

TOK_SELEXPR

  TOK_TABLE_OR_COL

customerid

  TOK_WHERE

TOK_FUNCTION

  in

  TOK_TABLE_OR_COL

customerid

  2

  3

  TOK_TABLE_OR_COL

customerid

" +



org.apache.spark.sql.hive.HiveQl$.nodeToExpr(HiveQl.scala:1098)



at scala.sys.package$.error(package.scala:27)

at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:252)

at 
org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:50)

at 
org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:49)

at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)

Are subqueries in predicates just not supported in 1.2? I think I’m seeing the 
same issue as:

http://apache-spark-user-list.1001560.n3.nabble.com/Subquery-in-having-clause-Spark-1-1-0-td17401.html

Thanks,
-Terry

Re: avro + parquet + vector + NullPointerException while reading

2014-11-04 Thread Michael Armbrust

You might consider using the native parquet support built into Spark SQL
instead of using the raw library:

http://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files

On Mon, Nov 3, 2014 at 7:33 PM, Michael Albert <
m_albert...@yahoo.com.invalid> wrote:

> Greetings!
>
>
> I'm trying to use avro and parquet with the following schema:
>
> {
>
> "name": "TestStruct",
>
> "namespace": "bughunt",
>
> "type": "record",
>
> "fields": [
>
> {
>
> "name": "string_array",
>
> "type": { "type": "array", "items": "string" }
>
> }
>
> ]
>
> }
> The writing process seems to be OK, but when I try to read it with Spark,
> I get:
>
> com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException
>
> Serialization trace:
>
> string_array (bughunt.TestStruct)
>
> at
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
>
> at
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
>
> at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
> When I try to read it with Hive, I get this:
>
> Failed with exception
> java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException:
> java.lang.ClassCastException: org.apache.hadoop.io.BytesWritable cannot be
> cast to org.apache.hadoop.io.ArrayWritable
> Which would lead me to suspect that this might be related to this one:
> https://github.com/Parquet/parquet-mr/issues/281 , but that one seems to
> be Hive specific, and I am not seeing Spark read the data it claims to have
> written itself.
>
> I'm running on an Amazon EMR cluster using the "version 2.4.0" hadoop code
> and spark 1.1.0.
> Has anyone else observed this sort of behavior?
>
> For completeness, here is the code that writes the data:
>
> package bughunt
>
>
> import org.apache.hadoop.mapreduce.Job
>
>
> import org.apache.spark.SparkConf
>
> import org.apache.spark.SparkContext
>
> import org.apache.spark.SparkContext._
>
>
>
> import parquet.avro.AvroWriteSupport
>
> import parquet.avro.AvroParquetOutputFormat
>
> import parquet.hadoop.ParquetOutputFormat
>
>
> import java.util.ArrayList
>
>
>
> object GenData {
>
> val outputPath = "/user/x/testdata"
>
> val words = List(
>
> List("apple", "banana", "cherry"),
>
> List("car", "boat", "plane"),
>
> List("lion", "tiger", "bear"),
>
> List("north", "south", "east", "west"),
>
> List("up", "down", "left", "right"),
>
> List("red", "green", "blue"))
>
>
> def main(args: Array[String]) {
>
> val conf = new SparkConf(true)
>
> .setAppName("IngestLoanApplicattion")
>
> //.set("spark.kryo.registrator",
>
> //classOf[CommonRegistrator].getName)
>
> .set("spark.serializer",
>
> "org.apache.spark.serializer.KryoSerializer")
>
> .set("spark.kryoserializer.buffer.mb", 4.toString)
>
> .set("spark.kryo.referenceTracking", "false")
>
>
> val sc = new SparkContext(conf)
>
>
> val rdd = sc.parallelize(words)
>
>
> val job = new Job(sc.hadoopConfiguration)
>
>
> ParquetOutputFormat.setWriteSupportClass(job,
> classOf[AvroWriteSupport])
>
> AvroParquetOutputFormat.setSchema(job,
>
> TestStruct.SCHEMA$)
>
>
> rdd.map(p => {
>
> val xs = new java.util.ArrayList[String]
>
> for (z<-p) { xs.add(z) }
>
> val bldr = TestStruct.newBuilder()
>
> bldr.setStringArray(xs)
>
> (null, bldr.build()) })
>
>.saveAsNewAPIHadoopFile(outputPath,
>
> classOf[Void],
>
> classOf[TestStruct],
>
> classOf[ParquetOutputFormat[TestStruct]],
>
> job.getConfiguration)
>
> }
>
> }
>
> To read the data, I use this sort of code from the spark-shell:
>
> :paste
>
>
> import bughunt.TestStruct
>
>
> import org.apache.hadoop.mapreduce.Job
>
> import org.apache.spark.SparkContext
>
>
> import parquet.hadoop.ParquetInputFormat
>
> import parquet.avro.AvroReadSupport
>
>
> def openRddSpecific(sc: SparkContext) = {
>
> val job = new Job(sc.hadoopConfiguration)
>
>
> ParquetInputFormat.setReadSupportClass(job,
>
> classOf[AvroReadSupport[TestStruct]])
>
>
> sc.newAPIHadoopFile("/user/malbert/testdata",
>
> classOf[ParquetInputFormat[TestStruct]],
>
> classOf[Void],
>
> classOf[TestStruct],
>
> job.getConfiguration)
>
> }
> I start the Spark shell as follows:
>
> spark-shell \
>
> --jars ../my-jar-containing-the-class-definitions.jar \
>
> --conf mapreduce.user.classpath.first=true \
>
> --conf spark.kryo.referenceTracking=f

Re: Streaming: which code is (not) executed at every batch interval?

2014-11-04 Thread Sean Owen

Yes, code is just local Scala code unless it's invoking Spark APIs.
The "non-Spark-streaming" block appears to just be normal program code
executed in your driver, which ultimately starts the streaming
machinery later. It executes once; there is nothing about that code
connected to Spark. It's not magic.

To execute code against every RDD you use operations like foreachRDD
on DStream to write a function that is executed at each batch interval
on an RDD.

On Tue, Nov 4, 2014 at 5:43 PM, spr  wrote:
> The use case I'm working on has a main data stream in which a human needs to
> modify what to look for.  I'm thinking to implement the main data stream
> with Spark Streaming and the things to look for with Spark.
> (Better approaches welcome.)
>
> To do this, I have intermixed Spark and Spark Streaming code, and it appears
> that the Spark code is not being executed every batch interval.  With
> details elided, it looks like
>
> val sc = new SparkContext(conf)
> val ssc = new StreamingContext(conf, Seconds(10))
> ssc.checkpoint(".")
>
> var lines = ssc.textFileStream(dirArg) //
> Spark Streaming code
> var linesArray = lines.map( line => (line.split("\t")))
>
> val whiteFd = (new java.io.File(whiteArg))  //
> non-Spark-Streaming code
> if (whiteFd.lastModified > System.currentTimeMillis-(timeSliceArg*1000))
> {
>   // read the file into a var
>
>
> //   Spark Streaming code
> var SvrCum = newState.updateStateByKey[(Int, Time, Time)](updateMyState)
>
> It appears the non-Spark-Streaming code gets executed once at program
> initiation but not repeatedly. So, two questions:
>
> 1)  Is it correct that Spark code does not get executed per batch interval?
>
> 2)  Is there a definition somewhere of what code will and will not get
> executed per batch interval?  (I didn't find it in either the Spark or Spark
> Streaming programming guides.)
>
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Streaming-which-code-is-not-executed-at-every-batch-interval-tp18071.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: loading, querying schemaRDD using SparkSQL

2014-11-04 Thread Michael Armbrust

Temporary tables are local to the context that creates them (just like
RDDs).  I'd recommend saving the data out as Parquet to share it between
contexts.

On Tue, Nov 4, 2014 at 3:18 AM, vdiwakar.malladi  wrote:

> Hi,
>
> There is a need in my application to query the loaded data into
> sparkcontext
> (I mean loaded SchemaRDD from JSON file(s)). For this purpose, I created
> the
> SchemaRDD and call registerTempTable method in a standalone program and
> submited the application using spark-submit command.
>
> Then I have a web application from which I used Spark SQL to query the
> same.
> But I'm getting 'Table Not Found' error.
>
> So, my question is, is it possible to load the data from one context and
> query from other? If not, can any one advice me on this.
>
> Thanks in advance.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/loading-querying-schemaRDD-using-SparkSQL-tp18052.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: scala RDD sortby compilation error

2014-11-04 Thread Josh J

I'm using the same code
,
though still receive

 not enough arguments for method sortBy: (f: String => K, ascending:
Boolean, numPartitions: Int)(implicit ord: Ordering[K], implicit ctag:
scala.reflect.ClassTag[K])org.apache.spark.rdd.RDD[String].

Unspecified value parameter f.

On Tue, Nov 4, 2014 at 11:28 AM, Josh J  wrote:

> Hi,
>
> Does anyone have any good examples of using sortby for RDDs and scala?
>
> I'm receiving
>
>  not enough arguments for method sortBy: (f: String => K, ascending:
> Boolean, numPartitions: Int)(implicit ord: Ordering[K], implicit ctag:
> scala.reflect.ClassTag[K])org.apache.spark.rdd.RDD[String].
>
> Unspecified value parameter f.
>
>
> I tried to follow the example in the test case
> 
>  by
> using the same approach even same method names and parameters though no
> luck.
>
>
> Thanks,
>
> Josh
>

Re: Spark SQL takes unexpected time

2014-11-04 Thread Michael Armbrust

People also store data off-heap by putting parquet data into Tachyon.

The optimization in 1.2 is to use the in-memory columnar cached format
instead of keeping row objects (and their boxed contents) around when you
call .cache().  This significantly reduces the number of live objects.
(since you have a single byte buffer per column batch).

On Tue, Nov 4, 2014 at 5:19 AM, Corey Nolet  wrote:

> Michael,
>
> I should probably look closer myself @ the design of 1.2 vs 1.1 but I've
> been curious why Spark's in-memory data uses the heap instead of putting it
> off heap? Was this the optimization that was done in 1.2 to alleviate GC?
>
> On Mon, Nov 3, 2014 at 8:52 PM, Shailesh Birari 
> wrote:
>
>> Yes, I am using Spark1.1.0 and have used rdd.registerTempTable().
>> I tried by adding sqlContext.cacheTable(), but it took 59 seconds (more
>> than
>> earlier).
>>
>> I also tried by changing schema to use Long data type in some fields but
>> seems conversion takes more time.
>> Is there any way to specify index ?  Though I checked and didn't found
>> any,
>> just want to confirm.
>>
>> For your reference here is the snippet of code.
>>
>>
>> -
>> case class EventDataTbl(EventUID: Long,
>> ONum: Long,
>> RNum: Long,
>> Timestamp: java.sql.Timestamp,
>> Duration: String,
>> Type: String,
>> Source: String,
>> OName: String,
>> RName: String)
>>
>> val format = new java.text.SimpleDateFormat("-MM-dd
>> hh:mm:ss")
>> val cedFileName =
>> "hdfs://hadoophost:8020/demo/poc/JoinCsv/output_2"
>> val cedRdd = sc.textFile(cedFileName).map(_.split(",",
>> -1)).map(p =>
>> EventDataTbl(p(0).toLong, p(1).toLong, p(2).toLong, new
>> java.sql.Timestamp(format.parse(p(3)).getTime()), p(4), p(5), p(6), p(7),
>> p(8)))
>>
>> cedRdd.registerTempTable("EventDataTbl")
>> sqlCntxt.cacheTable("EventDataTbl")
>>
>> val t1 = System.nanoTime()
>> println("\n\n10 Most frequent conversations between the
>> Originators and
>> Recipients\n")
>> sql("SELECT COUNT(*) AS Frequency,ONum,OName,RNum,RName
>> FROM EventDataTbl
>> GROUP BY ONum,OName,RNum,RName ORDER BY Frequency DESC LIMIT
>> 10").collect().foreach(println)
>> val t2 = System.nanoTime()
>> println("Time taken " + (t2-t1)/10.0 + " Seconds")
>>
>>
>> -
>>
>> Thanks,
>>   Shailesh
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-takes-unexpected-time-tp17925p18017.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Fwd: Master example.MovielensALS

2014-11-04 Thread Debasish Das

Hi,

I just built the master today and I was testing the IR metrics (MAP and
prec@k) on Movielens data to establish a baseline...

I am getting a weird error which I have not seen before:

MASTER=spark://TUSCA09LMLVT00C.local:7077 ./bin/run-example
mllib.MovieLensALS --kryo --lambda 0.065
hdfs://localhost:8020/sandbox/movielens/

2014-11-04 11:01:17.350 java[56469:1903] Unable to load realm mapping info
from SCDynamicStore

14/11/04 11:01:17 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable

14/11/04 11:01:22 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0,
192.168.107.125): java.io.InvalidClassException:
org.apache.spark.examples.mllib.MovieLensALS$Params; no valid constructor

For Params there is no valid constructorIs this due to the
AbstractParam change ?

Thanks.

Deb

Re: SparkSQL - No support for subqueries in 1.2-snapshot?

2014-11-04 Thread Michael Armbrust

This is not supported yet.  It would be great if you could open a JIRA
(though I think apache JIRA is down ATM).

On Tue, Nov 4, 2014 at 9:40 AM, Terry Siu  wrote:

>  I’m trying to execute a subquery inside an IN clause and am encountering
> an unsupported language feature in the parser.
>
>  java.lang.RuntimeException: Unsupported language features in query:
> select customerid from sparkbug where customerid in (select customerid from
> sparkbug where customerid in (2,3))
>
> TOK_QUERY
>
>   TOK_FROM
>
> TOK_TABREF
>
>   TOK_TABNAME
>
> sparkbug
>
>   TOK_INSERT
>
> TOK_DESTINATION
>
>   TOK_DIR
>
> TOK_TMP_FILE
>
> TOK_SELECT
>
>   TOK_SELEXPR
>
> TOK_TABLE_OR_COL
>
>   customerid
>
> TOK_WHERE
>
>   TOK_SUBQUERY_EXPR
>
> TOK_SUBQUERY_OP
>
>   in
>
> TOK_QUERY
>
>   TOK_FROM
>
> TOK_TABREF
>
>   TOK_TABNAME
>
> sparkbug
>
>   TOK_INSERT
>
> TOK_DESTINATION
>
>   TOK_DIR
>
> TOK_TMP_FILE
>
> TOK_SELECT
>
>   TOK_SELEXPR
>
> TOK_TABLE_OR_COL
>
>   customerid
>
> TOK_WHERE
>
>   TOK_FUNCTION
>
> in
>
> TOK_TABLE_OR_COL
>
>   customerid
>
> 2
>
> 3
>
> TOK_TABLE_OR_COL
>
>   customerid
>
>
>  scala.NotImplementedError: No parse rules for ASTNode type: 817, text:
> TOK_SUBQUERY_EXPR :
>
> TOK_SUBQUERY_EXPR
>
>   TOK_SUBQUERY_OP
>
> in
>
>   TOK_QUERY
>
> TOK_FROM
>
>   TOK_TABREF
>
> TOK_TABNAME
>
>   sparkbug
>
> TOK_INSERT
>
>   TOK_DESTINATION
>
> TOK_DIR
>
>   TOK_TMP_FILE
>
>   TOK_SELECT
>
> TOK_SELEXPR
>
>   TOK_TABLE_OR_COL
>
> customerid
>
>   TOK_WHERE
>
> TOK_FUNCTION
>
>   in
>
>   TOK_TABLE_OR_COL
>
> customerid
>
>   2
>
>   3
>
>   TOK_TABLE_OR_COL
>
> customerid
>
> " +
>
>
>
> org.apache.spark.sql.hive.HiveQl$.nodeToExpr(HiveQl.scala:1098)
>
>
>
> at scala.sys.package$.error(package.scala:27)
>
> at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:252)
>
> at
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:50)
>
> at
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:49)
>
> at
> scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
>
>  Are subqueries in predicates just not supported in 1.2? I think I’m
> seeing the same issue as:
>
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Subquery-in-having-clause-Spark-1-1-0-td17401.html
>
>  Thanks,
> -Terry
>
>

scala RDD sortby compilation error

2014-11-04 Thread Josh J

Hi,

Does anyone have any good examples of using sortby for RDDs and scala?

I'm receiving

 not enough arguments for method sortBy: (f: String => K, ascending:
Boolean, numPartitions: Int)(implicit ord: Ordering[K], implicit ctag:
scala.reflect.ClassTag[K])org.apache.spark.rdd.RDD[String].

Unspecified value parameter f.


I tried to follow the example in the test case

by
using the same approach even same method names and parameters though no
luck.


Thanks,

Josh

What's wrong with my settings about shuffle/storage.memoryFraction

2014-11-04 Thread Benyi Wang

I don't need to cache RDDs in my spark Application, but there is a big
shuffle in the data processing. I can always find Shuffle spill (memory)
and Shuffle spill (disk). I'm wondering if I can give more memory to
shuffle to avoid spill to disk.

export SPARK_JAVA_OPTS='-Dspark.shuffle.memoryFraction=.7
-Dspark.storage.memoryFraction=.1 -Dspark.io.compression.codec=org.apach
e.spark.io.SnappyCompressionCodec -Dspark.shuffle.compress=true'

Using standalone with the following parameters in spark-submit
--num-executors 3 --executor-memory 15G --executor-cores 6 --driver-memory
2G

And it turns out that a real bad idea. The worker keeps hanging some time
in reduceByKey step. Sometime I had to go to each data node, and kill the
worker manually.

If I remove SPARK_JAVA_OPTS, the application finished successfully.

Will all RDDs be stored in spark.storage.memoryFraction? if so, why tab
"executors" always show "0.0 B / 8.6 GB" for memory used? Also table
"storage" shows nothing in Application Detail UI.

Any idea how to set those parameters to avoid spill more data?

I'm using Spark-1.0.0 in CDH 5.1.0.

Re: Spark Streaming appears not to recognize a more recent version of an already-seen file; true?

2014-11-04 Thread spr

Holden Karau wrote
> This is the expected behavior. Spark Streaming only reads new files once,
> this is why they must be created through an atomic move so that Spark
> doesn't accidentally read a partially written file. I'd recommend looking
> at "Basic Sources" in the Spark Streaming guide (
> http://spark.apache.org/docs/latest/streaming-programming-guide.html ).

Thanks for the quick response.

OK, this does seem consistent with the rest of Spark Streaming.  Looking at
Basic Sources, it says
"Once moved [into the directory being observed], the files must not be
changed."  I don't think of removing a file and creating a new one under the
same name as "changing" the file, i.e., it has a different inode number.  It
might be more precise to say something like "Once a filename has been
detected by Spark Streaming, it will be viewed as having been processed for
the life of the context."  This end-case also implies that any
filename-generating code has to be certain it will not create repeats within
the life of a context, which is not easily deduced from the existing
description.

Thanks again.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-appears-not-to-recognize-a-more-recent-version-of-an-already-seen-file-true-tp18074p18076.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark Streaming appears not to recognize a more recent version of an already-seen file; true?

2014-11-04 Thread Holden Karau

This is the expected behavior. Spark Streaming only reads new files once,
this is why they must be created through an atomic move so that Spark
doesn't accidentally read a partially written file. I'd recommend looking
at "Basic Sources" in the Spark Streaming guide (
http://spark.apache.org/docs/latest/streaming-programming-guide.html ).

On Tue, Nov 4, 2014 at 10:41 AM, spr  wrote:

> I am trying to implement a use case that takes some human input.  Putting
> that in a single file (as opposed to a collection of HDFS files) would be a
> simpler human interface, so I tried an experiment with whether Spark
> Streaming (via textFileStream) will recognize a new version of a filename
> it
> has already digested.  (Yes, I'm deleting and moving a new file into the
> same name, not modifying in place.)  It appears the answer is No, it does
> not recognize a new version.  Can one of the experts confirm a) this is
> true
> and b) this is intended?
>
> Experiment:
> - run an existing program that works to digest new files in a directory
> - modify the data-creation script to put the new files always under the
> same
> name instead of different names, then run the script
>
> Outcome:  it sees the first file under that name, but none of the
> subsequent
> files (with different contents, which would show up in output).
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-appears-not-to-recognize-a-more-recent-version-of-an-already-seen-file-true-tp18074.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Cell : 425-233-8271

Spark Streaming appears not to recognize a more recent version of an already-seen file; true?

2014-11-04 Thread spr

I am trying to implement a use case that takes some human input.  Putting
that in a single file (as opposed to a collection of HDFS files) would be a
simpler human interface, so I tried an experiment with whether Spark
Streaming (via textFileStream) will recognize a new version of a filename it
has already digested.  (Yes, I'm deleting and moving a new file into the
same name, not modifying in place.)  It appears the answer is No, it does
not recognize a new version.  Can one of the experts confirm a) this is true
and b) this is intended?

Experiment:
- run an existing program that works to digest new files in a directory
- modify the data-creation script to put the new files always under the same
name instead of different names, then run the script

Outcome:  it sees the first file under that name, but none of the subsequent
files (with different contents, which would show up in output).



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-appears-not-to-recognize-a-more-recent-version-of-an-already-seen-file-true-tp18074.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Got java.lang.SecurityException: class "javax.servlet.FilterRegistration"'s when running job from intellij Idea

2014-11-04 Thread Sean Owen

Hadoop is certainly bringing them back in. You should mark all Hadoop
and Spark deps as "provided" to not even build them into your app.

On Tue, Nov 4, 2014 at 4:49 PM, Jaonary Rabarisoa  wrote:
> I don't understand why since there's no javax.servlet in my build.sbt :
>
>
> scalaVersion := "2.10.4"
>
> libraryDependencies ++= Seq(
>   "org.apache.spark" %% "spark-core" % "1.1.0",
>   "org.apache.spark" %% "spark-sql" % "1.1.0",
>   "org.apache.spark" %% "spark-mllib" % "1.1.0",
>   "org.apache.hadoop" % "hadoop-client" % "2.4.0",
>   "org.apache.commons" % "commons-math3" % "3.3",
>   "org.scalatest" % "scalatest_2.10" % "2.2.0" % "test"
> )
>
> resolvers += "Akka Repository" at "http://repo.akka.io/releases/";
>
>
> On Tue, Nov 4, 2014 at 11:00 AM, Sean Owen  wrote:
>>
>> Generally this means you included some javax.servlet dependency in
>> your project deps. You should exclude any of these as they conflict in
>> this bad way with other copies of the servlet API from Spark.
>>
>> On Tue, Nov 4, 2014 at 7:55 AM, Jaonary Rabarisoa 
>> wrote:
>> > Hi all,
>> >
>> > I have a spark job that I build with sbt and I can run without any
>> > problem
>> > with sbt run. But when I run it inside IntelliJ Idea I got the following
>> > error :
>> >
>> > Exception encountered when invoking run on a nested suite - class
>> > "javax.servlet.FilterRegistration"'s signer information does not match
>> > signer information of other classes in the same package
>> > java.lang.SecurityException: class "javax.servlet.FilterRegistration"'s
>> > signer information does not match signer information of other classes in
>> > the
>> > same package
>> > at java.lang.ClassLoader.checkCerts(ClassLoader.java:952)
>> > at java.lang.ClassLoader.preDefineClass(ClassLoader.java:666)
>> > at java.lang.ClassLoader.defineClass(ClassLoader.java:794)
>> > at
>> > java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
>> > at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
>> > at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
>> > at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
>> > at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>> > at java.security.AccessController.doPrivileged(Native Method)
>> > at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>> > at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>> > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>> > at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>> > at
>> >
>> > org.eclipse.jetty.servlet.ServletContextHandler.(ServletContextHandler.java:136)
>> > at
>> >
>> > org.eclipse.jetty.servlet.ServletContextHandler.(ServletContextHandler.java:129)
>> > at
>> >
>> > org.eclipse.jetty.servlet.ServletContextHandler.(ServletContextHandler.java:98)
>> > at
>> > org.apache.spark.ui.JettyUtils$.createServletHandler(JettyUtils.scala:98)
>> > at
>> > org.apache.spark.ui.JettyUtils$.createServletHandler(JettyUtils.scala:89)
>> > at org.apache.spark.ui.WebUI.attachPage(WebUI.scala:67)
>> > at org.apache.spark.ui.WebUI$$anonfun$attachTab$1.apply(WebUI.scala:60)
>> > at org.apache.spark.ui.WebUI$$anonfun$attachTab$1.apply(WebUI.scala:60)
>> > at
>> >
>> > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>> > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>> > at org.apache.spark.ui.WebUI.attachTab(WebUI.scala:60)
>> > at org.apache.spark.ui.SparkUI.initialize(SparkUI.scala:66)
>> > at org.apache.spark.ui.SparkUI.(SparkUI.scala:60)
>> > at org.apache.spark.ui.SparkUI.(SparkUI.scala:42)
>> > at org.apache.spark.SparkContext.(SparkContext.scala:223)
>> > at org.apache.spark.SparkContext.(SparkContext.scala:98)
>> >
>> >
>> > How can I solve this ?
>> >
>> >
>> > Cheers,
>> >
>> >
>> > Jao
>> >
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: java.io.NotSerializableException: org.apache.spark.SparkEnv

2014-11-04 Thread lordjoe

I posted on this issue in 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-access-objects-declared-and-initialized-outside-the-call-method-of-JavaRDD-td17094.html#a17150

Code starts
public class SparkUtilities extends Serializable 
private transient static ThreadLocal threadContext; 
private static String appName = "Anonymous"; 

essentially you need to get a context on the slave machine saving it in a
transient (non serialized) field - at least that is what you want to do in
Java




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/java-io-NotSerializableException-org-apache-spark-SparkEnv-tp10641p18072.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Streaming: which code is (not) executed at every batch interval?

2014-11-04 Thread spr

The use case I'm working on has a main data stream in which a human needs to
modify what to look for.  I'm thinking to implement the main data stream
with Spark Streaming and the things to look for with Spark.
(Better approaches welcome.)

To do this, I have intermixed Spark and Spark Streaming code, and it appears
that the Spark code is not being executed every batch interval.  With
details elided, it looks like

val sc = new SparkContext(conf)
val ssc = new StreamingContext(conf, Seconds(10))
ssc.checkpoint(".")

var lines = ssc.textFileStream(dirArg) //  
Spark Streaming code
var linesArray = lines.map( line => (line.split("\t")))

val whiteFd = (new java.io.File(whiteArg))  //  
non-Spark-Streaming code
if (whiteFd.lastModified > System.currentTimeMillis-(timeSliceArg*1000))
{
  // read the file into a var


  
//   Spark Streaming code
var SvrCum = newState.updateStateByKey[(Int, Time, Time)](updateMyState)

It appears the non-Spark-Streaming code gets executed once at program
initiation but not repeatedly. So, two questions:

1)  Is it correct that Spark code does not get executed per batch interval?

2)  Is there a definition somewhere of what code will and will not get
executed per batch interval?  (I didn't find it in either the Spark or Spark
Streaming programming guides.)





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Streaming-which-code-is-not-executed-at-every-batch-interval-tp18071.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

SparkSQL - No support for subqueries in 1.2-snapshot?

2014-11-04 Thread Terry Siu

I’m trying to execute a subquery inside an IN clause and am encountering an 
unsupported language feature in the parser.


java.lang.RuntimeException: Unsupported language features in query: select 
customerid from sparkbug where customerid in (select customerid from sparkbug 
where customerid in (2,3))

TOK_QUERY

  TOK_FROM

TOK_TABREF

  TOK_TABNAME

sparkbug

  TOK_INSERT

TOK_DESTINATION

  TOK_DIR

TOK_TMP_FILE

TOK_SELECT

  TOK_SELEXPR

TOK_TABLE_OR_COL

  customerid

TOK_WHERE

  TOK_SUBQUERY_EXPR

TOK_SUBQUERY_OP

  in

TOK_QUERY

  TOK_FROM

TOK_TABREF

  TOK_TABNAME

sparkbug

  TOK_INSERT

TOK_DESTINATION

  TOK_DIR

TOK_TMP_FILE

TOK_SELECT

  TOK_SELEXPR

TOK_TABLE_OR_COL

  customerid

TOK_WHERE

  TOK_FUNCTION

in

TOK_TABLE_OR_COL

  customerid

2

3

TOK_TABLE_OR_COL

  customerid


scala.NotImplementedError: No parse rules for ASTNode type: 817, text: 
TOK_SUBQUERY_EXPR :

TOK_SUBQUERY_EXPR

  TOK_SUBQUERY_OP

in

  TOK_QUERY

TOK_FROM

  TOK_TABREF

TOK_TABNAME

  sparkbug

TOK_INSERT

  TOK_DESTINATION

TOK_DIR

  TOK_TMP_FILE

  TOK_SELECT

TOK_SELEXPR

  TOK_TABLE_OR_COL

customerid

  TOK_WHERE

TOK_FUNCTION

  in

  TOK_TABLE_OR_COL

customerid

  2

  3

  TOK_TABLE_OR_COL

customerid

" +



org.apache.spark.sql.hive.HiveQl$.nodeToExpr(HiveQl.scala:1098)



at scala.sys.package$.error(package.scala:27)

at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:252)

at 
org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:50)

at 
org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:49)

at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)

Are subqueries in predicates just not supported in 1.2? I think I’m seeing the 
same issue as:

http://apache-spark-user-list.1001560.n3.nabble.com/Subquery-in-having-clause-Spark-1-1-0-td17401.html

Thanks,
-Terry

MEMORY_ONLY_SER question

2014-11-04 Thread Mohit Jaggi

Folks,
If I have an RDD persisted in MEMORY_ONLY_SER mode and then it is needed
for a transformation/action later, is the whole partition of the RDD
deserialized into Java objects first before my transform/action code works
on it? Or is it deserialized in a streaming manner as the iterator moves
over the partition? Is this behavior customizable? I generally use the Kryo
serializer.

Mohit.

Re: Got java.lang.SecurityException: class "javax.servlet.FilterRegistration"'s when running job from intellij Idea

2014-11-04 Thread Jaonary Rabarisoa

I don't understand why since there's no javax.servlet in my build.sbt :


scalaVersion := "2.10.4"

libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core" % "1.1.0",
  "org.apache.spark" %% "spark-sql" % "1.1.0",
  "org.apache.spark" %% "spark-mllib" % "1.1.0",
  "org.apache.hadoop" % "hadoop-client" % "2.4.0",
  "org.apache.commons" % "commons-math3" % "3.3",
  "org.scalatest" % "scalatest_2.10" % "2.2.0" % "test"
)

resolvers += "Akka Repository" at "http://repo.akka.io/releases/";


On Tue, Nov 4, 2014 at 11:00 AM, Sean Owen  wrote:

> Generally this means you included some javax.servlet dependency in
> your project deps. You should exclude any of these as they conflict in
> this bad way with other copies of the servlet API from Spark.
>
> On Tue, Nov 4, 2014 at 7:55 AM, Jaonary Rabarisoa 
> wrote:
> > Hi all,
> >
> > I have a spark job that I build with sbt and I can run without any
> problem
> > with sbt run. But when I run it inside IntelliJ Idea I got the following
> > error :
> >
> > Exception encountered when invoking run on a nested suite - class
> > "javax.servlet.FilterRegistration"'s signer information does not match
> > signer information of other classes in the same package
> > java.lang.SecurityException: class "javax.servlet.FilterRegistration"'s
> > signer information does not match signer information of other classes in
> the
> > same package
> > at java.lang.ClassLoader.checkCerts(ClassLoader.java:952)
> > at java.lang.ClassLoader.preDefineClass(ClassLoader.java:666)
> > at java.lang.ClassLoader.defineClass(ClassLoader.java:794)
> > at
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
> > at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
> > at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
> > at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
> > at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> > at java.security.AccessController.doPrivileged(Native Method)
> > at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> > at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
> > at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> > at
> >
> org.eclipse.jetty.servlet.ServletContextHandler.(ServletContextHandler.java:136)
> > at
> >
> org.eclipse.jetty.servlet.ServletContextHandler.(ServletContextHandler.java:129)
> > at
> >
> org.eclipse.jetty.servlet.ServletContextHandler.(ServletContextHandler.java:98)
> > at
> org.apache.spark.ui.JettyUtils$.createServletHandler(JettyUtils.scala:98)
> > at
> org.apache.spark.ui.JettyUtils$.createServletHandler(JettyUtils.scala:89)
> > at org.apache.spark.ui.WebUI.attachPage(WebUI.scala:67)
> > at org.apache.spark.ui.WebUI$$anonfun$attachTab$1.apply(WebUI.scala:60)
> > at org.apache.spark.ui.WebUI$$anonfun$attachTab$1.apply(WebUI.scala:60)
> > at
> >
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> > at org.apache.spark.ui.WebUI.attachTab(WebUI.scala:60)
> > at org.apache.spark.ui.SparkUI.initialize(SparkUI.scala:66)
> > at org.apache.spark.ui.SparkUI.(SparkUI.scala:60)
> > at org.apache.spark.ui.SparkUI.(SparkUI.scala:42)
> > at org.apache.spark.SparkContext.(SparkContext.scala:223)
> > at org.apache.spark.SparkContext.(SparkContext.scala:98)
> >
> >
> > How can I solve this ?
> >
> >
> > Cheers,
> >
> >
> > Jao
> >
>

Re: with SparkStreeaming spark-submit, don't see output after ssc.start()

2014-11-04 Thread spr

Yes, good catch.  I also realized, after I posted, that I was calling 2
different classes, though they are in the same JAR.   I went back and tried
it again with the same class in both cases, and it failed the same way.  I
thought perhaps having 2 classes in a JAR was an issue, but commenting out
one of the classes did not seem to make a difference.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/with-SparkStreeaming-spark-submit-don-t-see-output-after-ssc-start-tp17989p18066.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RE: Model characterization

2014-11-04 Thread Sameer Tilak

Excellent,  many thanks.  Really appreciate your help.

Sent via the Samsung GALAXY S®4, an AT&T 4G LTE smartphone

 Original message From: Xiangrui Meng 
 Date:11/03/2014  9:04 PM  (GMT-08:00) 
To: Sameer Tilak  Cc: 
user@spark.apache.org Subject: Re: Model characterization 

We recently added metrics for regression:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/evaluation/RegressionMetrics.scala
and you can use
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala
for ROC if it is a binary classification problem.

 -Xiangrui

On Mon, Nov 3, 2014 at 12:52 PM, Sameer Tilak  wrote:
> Hi All,
>
> I have been using LinearRegression model of MLLib and very pleased with its
> scalability and robustness. Right now, we are just calculating MSE of our
> model. We would like to characterize the performance of our model. I was
> wondering adding support for computing things such as Confidence Interval
> etc. are  they something that are on the roadmap? Graphical things such as
> ROC curves etc. will that be supported by MLLib/other parts of the
> ecosystem? or is this something for which other statistical packages are
> recommended?

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Streaming window operations not producing output

2014-11-04 Thread diogo

So, to answer my own n00b question, if case anyone ever needs it. You have
to enable checkpointing (by ssc.checkpoint(hdfsPath)). Windowed operations
need to be *checkpointed*, otherwise windows just won't work (and how could
they).

On Tue, Oct 28, 2014 at 10:24 AM, diogo  wrote:

> Hi there, I'm trying to use Window operations on streaming, but everything
> I perform a windowed computation, I stop getting results.
>
> For example:
>
> val wordCounts = pairs.reduceByKey(_ + _)
> wordCounts.print()
>
> Will print the output to the stdout on 'batch duration' interval. Now if I
> replace it with:
>
> val wordCounts = pairs.reduceByKeyAndWindow(_+_, _-_, Seconds(4),
> Seconds(2))
> wordCounts.print()
>
> It will never output. What did I get wrong?
>
> Thanks.
>

Re: Cleaning/transforming json befor converting to SchemaRDD

2014-11-04 Thread Yin Huai

Hi Daniel,

Right now, you need to do the transformation manually. The feature you need
is under development (https://issues.apache.org/jira/browse/SPARK-4190).

Thanks,

Yin

On Tue, Nov 4, 2014 at 2:44 AM, Gerard Maas  wrote:

> You could transform the json to a case class instead of serializing it
> back to a String. The resulting RDD[MyCaseClass] is then directly usable as
> a SchemaRDD using the register function implicitly provided by 'import
> sqlContext.schemaRDD'. Then the rest of your pipeline will remain the same.
>
> -kr, Gerard
> On Nov 4, 2014 5:05 AM, "Daniel Mahler"  wrote:
>
>> I am trying to convert terabytes of json log files into parquet files.
>> but I need to clean it a little first.
>> I end up doing the following
>>
>> txt = sc.textFile(inpath).coalesce(800)
>>
>> val json = (for {
>>  line <- txt
>>  JObject(child) = parse(line)
>>  child2 = (for {
>>JField(name, value) <- child
>>_ <- patt(name) // filter fields with invalid names
>>  } yield JField(name.toLowerCase, value))
>> } yield compact(render(JObject(child2
>>
>> sqx.jsonRDD(json, 5e-2).saveAsParquetFile(outpath)
>>
>> And glaring inefficiency is that after parsing and cleaning the data i
>> reserialize it
>> by calling compact(render(JObject(child2 only to pass the text
>> to jsonRDD to be parsed agian. However I see no way  to turn an RDD of
>> json4s objects directly into a SchemRDD without turning it back into text
>> first
>>
>> Is there any way to do this?
>>
>> I am also open to other suggestions for speeding up the above code,
>> it is very slow in its current form.
>>
>> I would also like to make jsonFile drop invalid json records rather than
>> failing the entire job. Is that possible?
>>
>> thanks
>> Daniel
>>
>>

Spark Streaming getOrCreate

2014-11-04 Thread sivarani

Hi All

I am using SparkStreaming..

public class SparkStreaming{
SparkConf sparkConf = new SparkConf().setAppName("Sales");
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new
Duration(5000));
String chkPntDir = ""; //get checkpoint dir
jssc.checkpoint(chkPntDir);
JavaSpark jSpark = new JavaSpark(); //this is where i have the business
logic
JavaStreamingContext newJSC = jSpark.callTest(jssc);
newJSC.start();
newJSC.awaitTermination();
}

where

public class JavaSpark implements Serializable{
public JavaStreamingContext callTest(JavaStreamingContext){
logic goes here
}
}

is working fine

But i try getOrCreate as i want spark streaming to run 24/7


JavaStreamingContextFactory contextFactory = new
JavaStreamingContextFactory() {
@Override
public JavaStreamingContext create() {
SparkConf sparkConf = new SparkConf().setAppName(appName).setMaster(master);
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new
Duration(5000));
jssc.checkpoint("checkpointDir");
JavaSpark js = new JavaSpark();
JavaStreamingContext newJssc = js.callTest(jssc);// This is where all the
logic is
return newJssc;
}

JavaStreamingContext context =
JavaStreamingContext.getOrCreate(checkPointDir, contextFactory);
context.start();
context.awaitTermination();

Not working

14/11/04 19:40:37 ERROR SparkDeploySchedulerBackend: Application has been
killed. Reason: All masters are unresponsive! Giving up.
Exception in thread "Thread-37" org.apache.spark.SparkException: Job aborted
due to stage failure: All masters are unresponsive! Giving up.
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
14/11/04 19:40:37 ERROR JobScheduler: Error running job streaming job
141511018 ms.0
org.apache.spark.SparkException: Job aborted due to stage failure: All
masters are unresponsive! Giving up.
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033)


Please help me out.

Earlier the biz logic was inside the ContextFactory but i got 

org.apache.spark.SparkException: Job aborted due to stage failure: Task not
serializable: java.io.NotSerializableException:
com.zoho.zbi.spark.PaymentStreaming$1

Then i added private static final long serialVersionUID =
-5751968749110204082L; in all the method dint work either

Got 

14/11/04 19:40:37 ERROR SparkDeploySchedulerBackend: Application has been
killed. Reason: All masters are unresponsive! Giving up.
Exception in thread "Thread-37" org.apache.spark.SparkException: Job aborted
due to stage failure: All masters are unresponsive! Giving up.
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apach

Re: Spark SQL takes unexpected time

2014-11-04 Thread Corey Nolet

Michael,

I should probably look closer myself @ the design of 1.2 vs 1.1 but I've
been curious why Spark's in-memory data uses the heap instead of putting it
off heap? Was this the optimization that was done in 1.2 to alleviate GC?

On Mon, Nov 3, 2014 at 8:52 PM, Shailesh Birari 
wrote:

> Yes, I am using Spark1.1.0 and have used rdd.registerTempTable().
> I tried by adding sqlContext.cacheTable(), but it took 59 seconds (more
> than
> earlier).
>
> I also tried by changing schema to use Long data type in some fields but
> seems conversion takes more time.
> Is there any way to specify index ?  Though I checked and didn't found any,
> just want to confirm.
>
> For your reference here is the snippet of code.
>
>
> -
> case class EventDataTbl(EventUID: Long,
> ONum: Long,
> RNum: Long,
> Timestamp: java.sql.Timestamp,
> Duration: String,
> Type: String,
> Source: String,
> OName: String,
> RName: String)
>
> val format = new java.text.SimpleDateFormat("-MM-dd
> hh:mm:ss")
> val cedFileName =
> "hdfs://hadoophost:8020/demo/poc/JoinCsv/output_2"
> val cedRdd = sc.textFile(cedFileName).map(_.split(",",
> -1)).map(p =>
> EventDataTbl(p(0).toLong, p(1).toLong, p(2).toLong, new
> java.sql.Timestamp(format.parse(p(3)).getTime()), p(4), p(5), p(6), p(7),
> p(8)))
>
> cedRdd.registerTempTable("EventDataTbl")
> sqlCntxt.cacheTable("EventDataTbl")
>
> val t1 = System.nanoTime()
> println("\n\n10 Most frequent conversations between the
> Originators and
> Recipients\n")
> sql("SELECT COUNT(*) AS Frequency,ONum,OName,RNum,RName
> FROM EventDataTbl
> GROUP BY ONum,OName,RNum,RName ORDER BY Frequency DESC LIMIT
> 10").collect().foreach(println)
> val t2 = System.nanoTime()
> println("Time taken " + (t2-t1)/10.0 + " Seconds")
>
>
> -
>
> Thanks,
>   Shailesh
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-takes-unexpected-time-tp17925p18017.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

stdout in spark applications

2014-11-04 Thread lokeshkumar

Hi Forum,

I am running a simple spark application in 1 master and 1 worker.
Submitting my application through spark submit as a java program. I have
sysout in the program, but I am not finding these sysouts in stdout/stderr
links in web ui of master as well in the SPARK_HOME/work directory.

Please let me know if I have to change any settings.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/stdout-in-spark-applications-tp18056.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: netty on classpath when using spark-submit

2014-11-04 Thread M. Dale


Tobias,
   From http://spark.apache.org/docs/latest/configuration.html it seems 
that there is an experimental property:


spark.files.userClassPathFirst

Whether to give user-added jars precedence over Spark's own jars when 
loading classes in Executors. This feature can be used to mitigate 
conflicts between Spark's dependencies and user dependencies. It is 
currently an experimental feature.


HTH,
Markus

On 11/04/2014 01:50 AM, Tobias Pfeiffer wrote:

Hi,

I tried hard to get a version of netty into my jar file created with 
sbt assembly that works with all my libraries. Now I managed that and 
was really happy, but it seems like spark-submit puts an older version 
of netty on the classpath when submitting to a cluster, such that my 
code ends up with an NoSuchMethodError:


Code:
  val a = new DefaultHttpRequest(HttpVersion.HTTP_1_1, HttpMethod.POST,
"http://localhost";)
  val f = new File(a.getClass.getProtectionDomain().
getCodeSource().getLocation().getPath())
  println(f.getAbsolutePath)
  println("headers: " + a.headers())

When executed with "sbt run":
~/.ivy2/cache/io.netty/netty/bundles/netty-3.9.4.Final.jar
  headers: org.jboss.netty.handler.codec.http.DefaultHttpHeaders@64934069

When executed with "spark-submit":
~/spark-1.1.0-bin-hadoop2.4/lib/spark-assembly-1.1.0-hadoop2.4.0.jar
  Exception in thread "main" java.lang.NoSuchMethodError: 
org.jboss.netty.handler.codec.http.DefaultHttpRequest.headers()Lorg/jboss/netty/handler/codec/http/HttpHeaders;

...

How can I get the old netty version off my classpath?

Thanks
Tobias

Re: How to make sure a ClassPath is always shipped to workers?

2014-11-04 Thread Akhil Das

You can add your custom jar in the SPARK_CLASSPATH inside spark-env.sh file
and restart the cluster to get it shipped on all the workers. Also you can
use the .setJars option and add the jar while creating the sparkContext.

Thanks
Best Regards

On Tue, Nov 4, 2014 at 8:12 AM, Peng Cheng  wrote:

> Sorry its a timeout duplicate, please remove it
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-make-sure-a-ClassPath-is-always-shipped-to-workers-tp18018p18020.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

1 2 >

1 - 100 of 116 matches

Mail list logo