Date class not supported by SparkSQL

2015-04-19 Thread Lior Chaga
Using Spark 1.2.0. Tried to apply register an RDD and got:
scala.MatchError: class java.util.Date (of class java.lang.Class)

I see it was resolved in https://issues.apache.org/jira/browse/SPARK-2562
(included in 1.2.0)

Anyone encountered this issue?

Thanks,
Lior


Re: Random pairs / RDD order

2015-04-19 Thread Aurélien Bellet

Hi Imran,

Thanks for the suggestion! Unfortunately the type does not match. But I 
could write my own function that shuffle the sample though.


Le 4/17/15 9:34 PM, Imran Rashid a écrit :

if you can store the entire sample for one partition in memory, I think
you just want:

val sample1 =
rdd.sample(true,0.01,42).mapPartitions(scala.util.Random.shuffle)
val sample2 =
rdd.sample(true,0.01,43).mapPartitions(scala.util.Random.shuffle)

...



On Fri, Apr 17, 2015 at 3:05 AM, Aurélien Bellet
aurelien.bel...@telecom-paristech.fr
mailto:aurelien.bel...@telecom-paristech.fr wrote:

Hi Sean,

Thanks a lot for your reply. The problem is that I need to sample
random *independent* pairs. If I draw two samples and build all
n*(n-1) pairs then there is a lot of dependency. My current solution
is also not satisfying because some pairs (the closest ones in a
partition) have a much higher probability to be sampled. Not sure
how to fix this.

Aurelien


Le 16/04/2015 20:44, Sean Owen a écrit :

Use mapPartitions, and then take two random samples of the
elements in
the partition, and return an iterator over all pairs of them? Should
be pretty simple assuming your sample size n is smallish since
you're
returning ~n^2 pairs.

On Thu, Apr 16, 2015 at 7:00 PM, abellet
aurelien.bel...@telecom-paristech.fr
mailto:aurelien.bel...@telecom-paristech.fr wrote:

Hi everyone,

I have a large RDD and I am trying to create a RDD of a
random sample of
pairs of elements from this RDD. The elements composing a
pair should come
from the same partition for efficiency. The idea I've come
up with is to
take two random samples and then use zipPartitions to pair
each i-th element
of the first sample with the i-th element of the second
sample. Here is a
sample code illustrating the idea:

---
val rdd = sc.parallelize(1 to 6, 16)

val sample1 = rdd.sample(true,0.01,42)
val sample2 = rdd.sample(true,0.01,43)

def myfunc(s1: Iterator[Int], s2: Iterator[Int]):
Iterator[String] =
{
var res = List[String]()
while (s1.hasNext  s2.hasNext)
{
  val x = s1.next +   + s2.next
  res ::= x
}
res.iterator
}

val pairs = sample1.zipPartitions(sample2)(myfunc)
-

However I am not happy with this solution because each
element is most
likely to be paired with elements that are closeby in the
partition. This
is because sample returns an ordered Iterator.

Any idea how to fix this? I did not find a way to
efficiently shuffle the
random sample so far.

Thanks a lot!



--
View this message in context:

http://apache-spark-user-list.1001560.n3.nabble.com/Random-pairs-RDD-order-tp22529.html
Sent from the Apache Spark User List mailing list archive at
Nabble.com.


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
mailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
mailto:user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
mailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
mailto:user-h...@spark.apache.org




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark application was submitted twice unexpectedly

2015-04-19 Thread Pengcheng Liu
looking into the work folder of problematic application, seems that the
application is continuing creating executors, and error log of worker is as
below:
Exception in thread main java.lang.reflect.UndeclaredThrowableException:
Unknown exception in doAs
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1134)
at
org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:52)
at
org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:113)
at
org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:156)
at
org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: java.security.PrivilegedActionException:
java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
... 4 more
Caused by: java.util.concurrent.TimeoutException: Futures timed out after
[30 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:125)
at
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:53)
at
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:52)
... 7 more

In master's log, I also see errors are generated continuously:
...
15/04/19 17:24:20 ERROR EndpointWriter: dropping message [class
org.apache.spark.deploy.DeployMessages$ExecutorUpdated] for non-local
recipient [Actor[akka.tcp://sparkDriver@spark101:46215/user/$b#-642262056]]
arriving at [akka.tcp://sparkDriver@spark101:46215] inbound addresses are
[akka.tcp://sparkMaster@spark101:7077]
15/04/19 17:24:20 ERROR EndpointWriter: dropping message [class
org.apache.spark.deploy.DeployMessages$ExecutorAdded] for non-local
recipient [Actor[akka.tcp://sparkDriver@spark101:46215/user/$b#-642262056]]
arriving at [akka.tcp://sparkDriver@spark101:46215] inbound addresses are
[akka.tcp://sparkMaster@spark101:7077]
15/04/19 17:24:20 ERROR EndpointWriter: dropping message [class
org.apache.spark.deploy.DeployMessages$ExecutorUpdated] for non-local
recipient [Actor[akka.tcp://sparkDriver@spark101:46215/user/$b#-642262056]]
arriving at [akka.tcp://sparkDriver@spark101:46215] inbound addresses are
[akka.tcp://sparkMaster@spark101:7077]
15/04/19 17:24:21 ERROR EndpointWriter: dropping message [class
org.apache.spark.deploy.DeployMessages$ExecutorUpdated] for non-local
recipient [Actor[akka.tcp://sparkDriver@spark101:46215/user/$b#-642262056]]
arriving at [akka.tcp://sparkDriver@spark101:46215] inbound addresses are
[akka.tcp://sparkMaster@spark101:7077]
15/04/19 17:24:21 ERROR UserGroupInformation: PriviledgedActionException
as:root cause:java.util.concurrent.TimeoutException: Futures timed out after
[30 seconds]
15/04/19 17:24:22 ERROR EndpointWriter: dropping message [class
org.apache.spark.deploy.DeployMessages$ExecutorUpdated] for non-local
recipient [Actor[akka.tcp://sparkDriver@spark101:53140/user/$b#-510580371]]
arriving at [akka.tcp://sparkDriver@spark101:53140] inbound addresses are
[akka.tcp://sparkMaster@spark101:7077]
15/04/19 17:24:22 ERROR EndpointWriter: dropping message [class
org.apache.spark.deploy.DeployMessages$ExecutorAdded] for non-local
recipient [Actor[akka.tcp://sparkDriver@spark101:53140/user/$b#-510580371]]
arriving at [akka.tcp://sparkDriver@spark101:53140] inbound addresses are
[akka.tcp://sparkMaster@spark101:7077]
...



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-application-was-submitted-twice-unexpectedly-tp22551p22560.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Skipped Jobs

2015-04-19 Thread Mark Hamstra
Almost.  Jobs don't get skipped.  Stages and Tasks do if the needed results
are already available.

On Sun, Apr 19, 2015 at 3:18 PM, Denny Lee denny.g@gmail.com wrote:

 The job is skipped because the results are available in memory from a
 prior run.  More info at:
 http://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3ccakx7bf-u+jc6q_zm7gtsj1mihagd_4up4qxpd9jfdjrfjax...@mail.gmail.com%3E.
 HTH!

 On Sun, Apr 19, 2015 at 1:43 PM James King jakwebin...@gmail.com wrote:

 In the web ui i can see some jobs as 'skipped' what does that mean? why
 are these jobs skipped? do they ever get executed?

 Regards
 jk




Re: Skipped Jobs

2015-04-19 Thread Denny Lee
Thanks for the correction Mark :)

On Sun, Apr 19, 2015 at 3:45 PM Mark Hamstra m...@clearstorydata.com
wrote:

 Almost.  Jobs don't get skipped.  Stages and Tasks do if the needed
 results are already available.

 On Sun, Apr 19, 2015 at 3:18 PM, Denny Lee denny.g@gmail.com wrote:

 The job is skipped because the results are available in memory from a
 prior run.  More info at:
 http://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3ccakx7bf-u+jc6q_zm7gtsj1mihagd_4up4qxpd9jfdjrfjax...@mail.gmail.com%3E.
 HTH!

 On Sun, Apr 19, 2015 at 1:43 PM James King jakwebin...@gmail.com wrote:

 In the web ui i can see some jobs as 'skipped' what does that mean? why
 are these jobs skipped? do they ever get executed?

 Regards
 jk





Re: newAPIHadoopRDD file name

2015-04-19 Thread hnahak
In record reader level you can pass the file name as key or value. 

 sc.newAPIHadoopRDD(job.getConfiguration,
classOf[AvroKeyInputFormat[myObject]],
classOf[AvroKey[myObject]],
classOf[Text] // can contain your file) 

AvroKeyInputFormat extends InputFormatAvroKey[myObject], Text 
{
   cretaRecordReader(){ return new YourRecordReader()}
 }

YourRecordReader extends RecordReaderAvroKey[myObject], Text{

 initialize(){
  Path file = inputSplit.getPath() ; // you can pass this file as a value
from your record reader 
  }
 
}



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/newAPIHadoopRDD-file-name-tp22556p22567.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Skipped Jobs

2015-04-19 Thread James King
In the web ui i can see some jobs as 'skipped' what does that mean? why are
these jobs skipped? do they ever get executed?

Regards
jk


RE: Can a map function return null

2015-04-19 Thread Evo Eftimov
In fact you can return “NULL” from your initial map and hence not resort to 
OptionalString at all 

 

From: Evo Eftimov [mailto:evo.efti...@isecc.com] 
Sent: Sunday, April 19, 2015 9:48 PM
To: 'Steve Lewis'
Cc: 'Olivier Girardot'; 'user@spark.apache.org'
Subject: RE: Can a map function return null

 

Well you can do another map to turn OptionalString into String as in the 
cases when Optional is empty you can store e.g. “NULL” as the value of the RDD 
element 

 

If this is not acceptable (based on the objectives of your architecture) and IF 
when returning plain null instead of Optional does throw Spark exception THEN 
as far as I am concerned, chess-mate 

 

From: Steve Lewis [mailto:lordjoe2...@gmail.com] 
Sent: Sunday, April 19, 2015 8:16 PM
To: Evo Eftimov
Cc: Olivier Girardot; user@spark.apache.org
Subject: Re: Can a map function return null

 

 

So you imagine something like this:

 

 JavaRDDString words = ...

 JavaRDD OptionalString wordsFiltered = words.map(new FunctionString, 
OptionalString() {
@Override
public OptionalString call(String s) throws Exception {
if ((s.length()) % 2 == 1) // drop strings of odd length
return Optional.empty();
else
return Optional.of(s);
}
});
 
That seems to return the wrong type a  JavaRDD OptionalString which cannot 
be used as a JavaRDDString which is what the next step expects

 

On Sun, Apr 19, 2015 at 12:17 PM, Evo Eftimov evo.efti...@isecc.com wrote:

I am on the move at the moment so i cant try it immediately but from previous 
memory / experience i think if you return plain null you will get a spark 
exception

 

Anyway yiu can try it and see what happens and then ask the question 

 

If you do get exception try Optional instead of plain null

 

 

Sent from Samsung Mobile

 

 Original message 

From: Olivier Girardot 

Date:2015/04/18 22:04 (GMT+00:00) 

To: Steve Lewis ,user@spark.apache.org 

Subject: Re: Can a map function return null 

 

You can return an RDD with null values inside, and afterwards filter on item 
!= null 
In scala (or even in Java 8) you'd rather use Option/Optional, and in Scala 
they're directly usable from Spark. 

Exemple : 

 sc.parallelize(1 to 1000).flatMap(item = if (item % 2 ==0) Some(item) else 
None).collect()

res0: Array[Int] = Array(2, 4, 6, )

Regards, 

Olivier.

 

Le sam. 18 avr. 2015 à 20:44, Steve Lewis lordjoe2...@gmail.com a écrit :

I find a number of cases where I have an JavaRDD and I wish to transform the 
data and depending on a test return 0 or one item (don't suggest a filter - the 
real case is more complex). So I currently do something like the following - 
perform a flatmap returning a list with 0 or 1 entry depending on the isUsed 
function.


 

 JavaRDDFoo original = ...

  JavaRDDFoo words = original.flatMap(new FlatMapFunctionFoo, Foo() {

@Override

public IterableFoo call(final Foo s) throws Exception {

ListFoo ret = new ArrayListFoo();

  if(isUsed(s))

   ret.add(transform(s));

return ret; // contains 0 items if isUsed is false

}

});

 

My question is can I do a map returning the transformed data and null if 
nothing is to be returned. as shown below - what does a Spark do with a map 
function returning null

 

JavaRDDFoo words = original.map(new MapFunctionString, String() {

@Override

  Foo  call(final Foo s) throws Exception {

ListFoo ret = new ArrayListFoo();

  if(isUsed(s))

   return transform(s);

return null; // not used - what happens now

}

});

 

 

 





 

-- 

Steven M. Lewis PhD

4221 105th Ave NE

Kirkland, WA 98033

206-384-1340 (cell)
Skype lordjoe_com



GraphX: unbalanced computation and slow runtime on livejournal network

2015-04-19 Thread harenbergsd
Hi all,

I have been testing GraphX on the soc-LiveJournal1 network from the SNAP
repository. Currently I am running on c3.8xlarge EC2 instances on Amazon.
These instances have 32 cores and 60GB RAM per node, and so far I have run
SSSP, PageRank, and WCC on a 1, 4, and 8 node cluster.

The issues I am having, which are present for all three algorithms, is that
(1) GraphX is not improving between 4 and 8 nodes and (2) GraphX seems to be
heavily unbalanced with some machines doing the majority of the computation.

PageRank (20 iterations) is the worst. For 1-node, 4-node, an 8-node
clusters I get the following runtimes (wallclock): 192s, 154s, and 154s.
This results is potentially understandable, though the times are
significantly worse than the results in the paper
https://amplab.cs.berkeley.edu/wp-content/uploads/2014/02/graphx.pdf, where
this algorithm ran in ~75s on a worse cluster.

My main concern is that the computation seems to be heavily unbalanced. I
have measured the CPU time of all the process associated with GraphX during
its execution and for a 4-node cluster it yielded the following CPU times
(for each machine): 724s, 697s, 2216s, 694s.

Is this normal? Should I expect a more even distribution of work across
machines?

I am using the stock pagerank code found here:
https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala.
I use the configurations spark.executor.memory=40g and
spark.cores.max=128 for the 4-node case. I also set the number of edge
partitions to be 64.

Could you please let me know if these results are reasonable, or if I am
doing something wrong. I really appreciate the help.

Thanks,
Steve



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-unbalanced-computation-and-slow-runtime-on-livejournal-network-tp22565.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Can a map function return null

2015-04-19 Thread Evo Eftimov
Well you can do another map to turn OptionalString into String as in the 
cases when Optional is empty you can store e.g. “NULL” as the value of the RDD 
element 

 

If this is not acceptable (based on the objectives of your architecture) and IF 
when returning plain null instead of Optional does throw Spark exception THEN 
as far as I am concerned, chess-mate 

 

From: Steve Lewis [mailto:lordjoe2...@gmail.com] 
Sent: Sunday, April 19, 2015 8:16 PM
To: Evo Eftimov
Cc: Olivier Girardot; user@spark.apache.org
Subject: Re: Can a map function return null

 

 

So you imagine something like this:

 

 JavaRDDString words = ...

 JavaRDD OptionalString wordsFiltered = words.map(new FunctionString, 
OptionalString() {
@Override
public OptionalString call(String s) throws Exception {
if ((s.length()) % 2 == 1) // drop strings of odd length
return Optional.empty();
else
return Optional.of(s);
}
});
 
That seems to return the wrong type a  JavaRDD OptionalString which cannot 
be used as a JavaRDDString which is what the next step expects

 

On Sun, Apr 19, 2015 at 12:17 PM, Evo Eftimov evo.efti...@isecc.com wrote:

I am on the move at the moment so i cant try it immediately but from previous 
memory / experience i think if you return plain null you will get a spark 
exception

 

Anyway yiu can try it and see what happens and then ask the question 

 

If you do get exception try Optional instead of plain null

 

 

Sent from Samsung Mobile

 

 Original message 

From: Olivier Girardot 

Date:2015/04/18 22:04 (GMT+00:00) 

To: Steve Lewis ,user@spark.apache.org 

Subject: Re: Can a map function return null 

 

You can return an RDD with null values inside, and afterwards filter on item 
!= null 
In scala (or even in Java 8) you'd rather use Option/Optional, and in Scala 
they're directly usable from Spark. 

Exemple : 

 sc.parallelize(1 to 1000).flatMap(item = if (item % 2 ==0) Some(item) else 
None).collect()

res0: Array[Int] = Array(2, 4, 6, )

Regards, 

Olivier.

 

Le sam. 18 avr. 2015 à 20:44, Steve Lewis lordjoe2...@gmail.com a écrit :

I find a number of cases where I have an JavaRDD and I wish to transform the 
data and depending on a test return 0 or one item (don't suggest a filter - the 
real case is more complex). So I currently do something like the following - 
perform a flatmap returning a list with 0 or 1 entry depending on the isUsed 
function.


 

 JavaRDDFoo original = ...

  JavaRDDFoo words = original.flatMap(new FlatMapFunctionFoo, Foo() {

@Override

public IterableFoo call(final Foo s) throws Exception {

ListFoo ret = new ArrayListFoo();

  if(isUsed(s))

   ret.add(transform(s));

return ret; // contains 0 items if isUsed is false

}

});

 

My question is can I do a map returning the transformed data and null if 
nothing is to be returned. as shown below - what does a Spark do with a map 
function returning null

 

JavaRDDFoo words = original.map(new MapFunctionString, String() {

@Override

  Foo  call(final Foo s) throws Exception {

ListFoo ret = new ArrayListFoo();

  if(isUsed(s))

   return transform(s);

return null; // not used - what happens now

}

});

 

 

 





 

-- 

Steven M. Lewis PhD

4221 105th Ave NE

Kirkland, WA 98033

206-384-1340 (cell)
Skype lordjoe_com



Re: GraphX: unbalanced computation and slow runtime on livejournal network

2015-04-19 Thread hnahak
Hi Steve

i did spark 1.3.0 page rank bench-marking  on soc-LiveJournal1 in 4 node
cluster. 16,16,8,8 Gbs ram respectively. Cluster have 4 worker including
master with 4,4,2,2 CPUs 
I set executor memroy to 3g and driver to 5g.   

No. of Iterations   -- GraphX(mins)
1   -- 1
2   -- 1.2
3   -- 1.3
5   -- 1.6
10  -- 2.6
20  -- 3.9
   



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-unbalanced-computation-and-slow-runtime-on-livejournal-network-tp22565p22566.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: MLlib -Collaborative Filtering

2015-04-19 Thread Christian S. Perone
The easiest way to do that is to use a similarity metric between the
different user factors.

On Sat, Apr 18, 2015 at 7:49 AM, riginos samarasrigi...@gmail.com wrote:

 Is there any way that i can see the similarity table of 2 users in that
 algorithm? by that i mean the similarity between 2 users



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-Collaborative-Filtering-tp22553.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 
Blog http://blog.christianperone.com | Github https://github.com/perone
| Twitter https://twitter.com/tarantulae
Forgive, O Lord, my little jokes on Thee, and I'll forgive Thy great big
joke on me.


Aggregation by column and generating a json

2015-04-19 Thread dsub
I am exploring Spark SQL and Dataframe and trying to create an aggregration
by column and generate a single json row with aggregation. Any inputs on the
right approach will be helpful. 

Here is my sample data
user,sports,major,league,count

[test1,Sports,Switzerland,NLA,6]
[test1,Football,Australia,A-League,6]
[test1,Ice Hockey,Sweden,SHL,3]
[test1,Ice Hockey,Switzerland,NLB,2]
[test1,Football,Romania,Liga I,1]

I want to aggregate by user and create a single json row. 

{ user :  test1 , sports : [ { Ice Hockey : 11, Football : 7 }] , major
: [ {Switzerland : 8, Australia :6  , Sweden : 3 , Romania :1 }]
,league : [ NLA : 6 , A-League : 6 , SHL :3 , NLB :2 ,  Liga I :
1] , total : 18}



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Aggregation-by-column-and-generating-a-json-tp22562.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Can a map function return null

2015-04-19 Thread Steve Lewis
So you imagine something like this:

 JavaRDDString words = ...

 JavaRDD OptionalString wordsFiltered = words.map(new
FunctionString, OptionalString() {
@Override
public OptionalString call(String s) throws Exception {
if ((s.length()) % 2 == 1) // drop strings of odd length
return Optional.empty();
else
return Optional.of(s);
}
});


That seems to return the wrong type a  JavaRDD OptionalString
which cannot be used as a JavaRDDString which is what the next step
expects


On Sun, Apr 19, 2015 at 12:17 PM, Evo Eftimov evo.efti...@isecc.com wrote:

 I am on the move at the moment so i cant try it immediately but from
 previous memory / experience i think if you return plain null you will get
 a spark exception

 Anyway yiu can try it and see what happens and then ask the question

 If you do get exception try Optional instead of plain null


 Sent from Samsung Mobile


  Original message 
 From: Olivier Girardot
 Date:2015/04/18 22:04 (GMT+00:00)
 To: Steve Lewis ,user@spark.apache.org
 Subject: Re: Can a map function return null

 You can return an RDD with null values inside, and afterwards filter on
 item != null
 In scala (or even in Java 8) you'd rather use Option/Optional, and in
 Scala they're directly usable from Spark.
 Exemple :

  sc.parallelize(1 to 1000).flatMap(item = if (item % 2 ==0) Some(item)
 else None).collect()

 res0: Array[Int] = Array(2, 4, 6, )

 Regards,

 Olivier.

 Le sam. 18 avr. 2015 à 20:44, Steve Lewis lordjoe2...@gmail.com a
 écrit :

 I find a number of cases where I have an JavaRDD and I wish to transform
 the data and depending on a test return 0 or one item (don't suggest a
 filter - the real case is more complex). So I currently do something like
 the following - perform a flatmap returning a list with 0 or 1 entry
 depending on the isUsed function.

  JavaRDDFoo original = ...
   JavaRDDFoo words = original.flatMap(new FlatMapFunctionFoo, Foo() {
 @Override
 public IterableFoo call(final Foo s) throws Exception {
 ListFoo ret = new ArrayListFoo();
   if(isUsed(s))
ret.add(transform(s));
 return ret; // contains 0 items if isUsed is false
 }
 });

 My question is can I do a map returning the transformed data and null if
 nothing is to be returned. as shown below - what does a Spark do with a map
 function returning null

 JavaRDDFoo words = original.map(new MapFunctionString, String() {
 @Override
   Foo  call(final Foo s) throws Exception {
 ListFoo ret = new ArrayListFoo();
   if(isUsed(s))
return transform(s);
 return null; // not used - what happens now
 }
 });






-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com


GraphX: unbalanced computation and slow runtime on livejournal network

2015-04-19 Thread Steven Harenberg
Hi all,

I have been testing GraphX on the soc-LiveJournal1 network from the SNAP
repository. Currently I am running on c3.8xlarge EC2 instances on Amazon.
These instances have 32 cores and 60GB RAM per node, and so far I have run
SSSP, PageRank, and WCC on a 1, 4, and 8 node cluster.

The issues I am having, which are present for all three algorithms, is that
(1) GraphX is not improving between 4 and 8 nodes and (2) GraphX seems to
be heavily unbalanced with some machines doing the majority of the
computation.

PageRank (20 iterations) is the worst. For 1-node, 4-node, an 8-node
clusters I get the following runtimes (wallclock): 192s, 154s, and 154s.
These results are potentially understandable, though the times are
significantly worse than the results in the paper
https://amplab.cs.berkeley.edu/wp-content/uploads/2014/02/graphx.pdf, where
this algorithm ran in ~75s on a worse cluster.

My main concern is that the computation seems to be heavily unbalanced. I
have measured the CPU time of all the process associated with GraphX during
its execution and for a 4-node cluster it yielded the following CPU times
(for each machine): 724s, 697s, 2216s, 694s.

Is this normal? Should I expect a more even distribution of work across
machines?

I am using the stock pagerank code found here:
https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala.
I use the configurations spark.executor.memory=40g and
spark.cores.max=128 for the 4-node case. I also set the number of edge
partitions to be 64.

Could you please let me know if these results are reasonable or if there is
a better way to ensure the computation is better distributed among the
nodes in a cluster. I really appreciate the help.

Thanks,
Steve


compliation error

2015-04-19 Thread Brahma Reddy Battula
Hi All

Getting following error, when I am compiling spark..What did I miss..? Even 
googled and did not find the exact solution for this...


[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-shade-plugin:2.2:shade (default) on project 
spark-assembly_2.10: Error creating shaded jar: Error in ASM processing class 
com/ibm/icu/impl/data/LocaleElements_zh__PINYIN.class: 48188 - [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal 
org.apache.maven.plugins:maven-shade-plugin:2.2:shade (default) on project 
spark-assembly_2.10: Error creating shaded jar: Error in ASM processing class 
com/ibm/icu/impl/data/LocaleElements_zh__PINYIN.class
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:217)
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
at 
org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:84)
at 
org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:59)
at 
org.apache.maven.lifecycle.internal.LifecycleStarter.singleThreadedBuild(LifecycleStarter.java:183)
at 
org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:161)
at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:320)




Thanks  Regards

Brahma Reddy Battula





Re: compliation error

2015-04-19 Thread Ted Yu
What JDK release are you using ?

Can you give the complete command you used ?

Which Spark branch are you working with ?

Cheers

On Sun, Apr 19, 2015 at 7:25 PM, Brahma Reddy Battula 
brahmareddy.batt...@huawei.com wrote:

  Hi All

 Getting following error, when I am compiling spark..What did I miss..?
 Even googled and did not find the exact solution for this...


 [ERROR] Failed to execute goal
 org.apache.maven.plugins:maven-shade-plugin:2.2:shade (default) on project
 spark-assembly_2.10: Error creating shaded jar: Error in ASM processing
 class com/ibm/icu/impl/data/LocaleElements_zh__PINYIN.class: 48188 - [Help
 1]
 org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
 goal org.apache.maven.plugins:maven-shade-plugin:2.2:shade (default) on
 project spark-assembly_2.10: Error creating shaded jar: Error in ASM
 processing class com/ibm/icu/impl/data/LocaleElements_zh__PINYIN.class
 at
 org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:217)
 at
 org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
 at
 org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
 at
 org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:84)
 at
 org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:59)
 at
 org.apache.maven.lifecycle.internal.LifecycleStarter.singleThreadedBuild(LifecycleStarter.java:183)
 at
 org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:161)
 at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:320)



  Thanks  Regards

 Brahma Reddy Battula






Re: dataframe can not find fields after loading from hive

2015-04-19 Thread Yin Huai
Hi Cesar,

Can you try 1.3.1 (
https://spark.apache.org/releases/spark-release-1-3-1.html) and see if it
still shows the error?

Thanks,

Yin

On Fri, Apr 17, 2015 at 1:58 PM, Reynold Xin r...@databricks.com wrote:

 This is strange. cc the dev list since it might be a bug.



 On Thu, Apr 16, 2015 at 3:18 PM, Cesar Flores ces...@gmail.com wrote:

 Never mind. I found the solution:

 val newDataFrame = hc.createDataFrame(hiveLoadedDataFrame.rdd,
 hiveLoadedDataFrame.schema)

 which translate to convert the data frame to rdd and back again to data
 frame. Not the prettiest solution, but at least it solves my problems.


 Thanks,
 Cesar Flores



 On Thu, Apr 16, 2015 at 11:17 AM, Cesar Flores ces...@gmail.com wrote:


 I have a data frame in which I load data from a hive table. And my issue
 is that the data frame is missing the columns that I need to query.

 For example:

 val newdataset = dataset.where(dataset(label) === 1)

 gives me an error like the following:

 ERROR yarn.ApplicationMaster: User class threw exception: resolved
 attributes label missing from label, user_id, ...(the rest of the fields of
 my table
 org.apache.spark.sql.AnalysisException: resolved attributes label
 missing from label, user_id, ... (the rest of the fields of my table)

 where we can see that the label field actually exist. I manage to solve
 this issue by updating my syntax to:

 val newdataset = dataset.where($label === 1)

 which works. However I can not make this trick in all my queries. For
 example, when I try to do a unionAll from two subsets of the same data
 frame the error I am getting is that all my fields are missing.

 Can someone tell me if I need to do some post processing after loading
 from hive in order to avoid this kind of errors?


 Thanks
 --
 Cesar Flores




 --
 Cesar Flores





Re: Can't get SparkListener to work

2015-04-19 Thread Shixiong Zhu
The problem is the code you use to test:

sc.parallelize(List(1, 2, 3)).map(throw new
SparkException(test)).collect();

is like the following example:

def foo: Int = Nothing = {
  throw new SparkException(test)
}
sc.parallelize(List(1, 2, 3)).map(foo).collect();

So actually the Spark jobs do not be submitted since it fails in `foo` that
is used to create the map function.

Change it to

sc.parallelize(List(1, 2, 3)).map(i = throw new
SparkException(test)).collect();

And you will see the correct messages from your listener.



Best Regards,
Shixiong(Ryan) Zhu

2015-04-19 1:06 GMT+08:00 Praveen Balaji secondorderpolynom...@gmail.com:

 Thanks for the response, Archit. I get callbacks when I do not throw an
 exception from map.
 My use case, however, is to get callbacks for exceptions in
 transformations on executors. Do you think I'm going down the right route?

 Cheers
 -p

 On Sat, Apr 18, 2015 at 1:49 AM, Archit Thakur archit279tha...@gmail.com
 wrote:

 Hi Praveen,
 Can you try once removing throw exception in map. Do you still not get
 it.?
 On Apr 18, 2015 8:14 AM, Praveen Balaji 
 secondorderpolynom...@gmail.com wrote:

 Thanks for the response, Imran. I probably chose the wrong methods for
 this email. I implemented all methods of SparkListener and the only
 callback I get is onExecutorMetricsUpdate.

 Here's the complete code:

 ==

 import org.apache.spark.scheduler._

 sc.addSparkListener(new SparkListener() {
   override def onStageCompleted(e: SparkListenerStageCompleted) =
 println( onStageCompleted);
   override def onStageSubmitted(e: SparkListenerStageSubmitted) =
 println( onStageSubmitted);
   override def onTaskStart(e: SparkListenerTaskStart) =
 println( onTaskStart);
   override def onTaskGettingResult(e:
 SparkListenerTaskGettingResult) = println( onTaskGettingResult);
   override def onTaskEnd(e: SparkListenerTaskEnd) = println(
 onTaskEnd);
   override def onJobStart(e: SparkListenerJobStart) = println(
 onJobStart);
   override def onJobEnd(e: SparkListenerJobEnd) = println(
 onJobEnd);
   override def onEnvironmentUpdate(e:
 SparkListenerEnvironmentUpdate) = println( onEnvironmentUpdate);
   override def onBlockManagerAdded(e:
 SparkListenerBlockManagerAdded) = println( onBlockManagerAdded);
   override def onBlockManagerRemoved(e:
 SparkListenerBlockManagerRemoved) = println( onBlockManagerRemoved);
   override def onUnpersistRDD(e: SparkListenerUnpersistRDD) =
 println( onUnpersistRDD);
   override def onApplicationStart(e: SparkListenerApplicationStart)
 = println( onApplicationStart);
   override def onApplicationEnd(e: SparkListenerApplicationEnd) =
 println( onApplicationEnd);
   override def onExecutorMetricsUpdate(e:
 SparkListenerExecutorMetricsUpdate) = println(
 onExecutorMetricsUpdate);
 });

 sc.parallelize(List(1, 2, 3)).map(throw new
 SparkException(test)).collect();

 =

 On Fri, Apr 17, 2015 at 4:13 PM, Imran Rashid iras...@cloudera.com
 wrote:

 when you start the spark-shell, its already too late to get the
 ApplicationStart event.  Try listening for StageCompleted or JobEnd 
 instead.

 On Fri, Apr 17, 2015 at 5:54 PM, Praveen Balaji 
 secondorderpolynom...@gmail.com wrote:

 I'm trying to create a simple SparkListener to get notified of error
 on executors. I do not get any call backs on my SparkListener. Here some
 simple code I'm executing in spark-shell. But I still don't get any
 callbacks on my listener. Am I doing something wrong?

 Thanks for any clue you can send my way.

 Cheers
 Praveen

 ==
 import org.apache.spark.scheduler.SparkListener
 import org.apache.spark.scheduler.SparkListenerApplicationStart
 import org.apache.spark.scheduler.SparkListenerApplicationEnd
 import org.apache.spark.SparkException

 sc.addSparkListener(new SparkListener() {
   override def onApplicationStart(applicationStart:
 SparkListenerApplicationStart) {
 println( onApplicationStart:  +
 applicationStart.appName);
   }

   override def onApplicationEnd(applicationEnd:
 SparkListenerApplicationEnd) {
 println( onApplicationEnd:  + applicationEnd.time);
   }
 });

 sc.parallelize(List(1, 2, 3)).map(throw new
 SparkException(test)).collect();
 ===

 output:

 scala org.apache.spark.SparkException: hshsh
 at $iwC$$iwC$$iwC$$iwC.init(console:29)
 at $iwC$$iwC$$iwC.init(console:34)
 at $iwC$$iwC.init(console:36)
 at $iwC.init(console:38)







Re: Can't get SparkListener to work

2015-04-19 Thread Praveen Balaji
Thanks Shixiong. I'll try this.

On Sun, Apr 19, 2015, 7:36 PM Shixiong Zhu zsxw...@gmail.com wrote:

 The problem is the code you use to test:


 sc.parallelize(List(1, 2, 3)).map(throw new
 SparkException(test)).collect();

 is like the following example:

 def foo: Int = Nothing = {
   throw new SparkException(test)
 }
 sc.parallelize(List(1, 2, 3)).map(foo).collect();

 So actually the Spark jobs do not be submitted since it fails in `foo`
 that is used to create the map function.

 Change it to

 sc.parallelize(List(1, 2, 3)).map(i = throw new
 SparkException(test)).collect();

 And you will see the correct messages from your listener.



 Best Regards,
 Shixiong(Ryan) Zhu

 2015-04-19 1:06 GMT+08:00 Praveen Balaji secondorderpolynom...@gmail.com
 :

 Thanks for the response, Archit. I get callbacks when I do not throw an
 exception from map.
 My use case, however, is to get callbacks for exceptions in
 transformations on executors. Do you think I'm going down the right route?

 Cheers
 -p

 On Sat, Apr 18, 2015 at 1:49 AM, Archit Thakur archit279tha...@gmail.com
  wrote:

 Hi Praveen,
 Can you try once removing throw exception in map. Do you still not get
 it.?
 On Apr 18, 2015 8:14 AM, Praveen Balaji 
 secondorderpolynom...@gmail.com wrote:

 Thanks for the response, Imran. I probably chose the wrong methods for
 this email. I implemented all methods of SparkListener and the only
 callback I get is onExecutorMetricsUpdate.

 Here's the complete code:

 ==

 import org.apache.spark.scheduler._

 sc.addSparkListener(new SparkListener() {
   override def onStageCompleted(e: SparkListenerStageCompleted) =
 println( onStageCompleted);
   override def onStageSubmitted(e: SparkListenerStageSubmitted) =
 println( onStageSubmitted);
   override def onTaskStart(e: SparkListenerTaskStart) =
 println( onTaskStart);
   override def onTaskGettingResult(e:
 SparkListenerTaskGettingResult) = println( onTaskGettingResult);
   override def onTaskEnd(e: SparkListenerTaskEnd) = println(
 onTaskEnd);
   override def onJobStart(e: SparkListenerJobStart) = println(
 onJobStart);
   override def onJobEnd(e: SparkListenerJobEnd) = println(
 onJobEnd);
   override def onEnvironmentUpdate(e:
 SparkListenerEnvironmentUpdate) = println( onEnvironmentUpdate);
   override def onBlockManagerAdded(e:
 SparkListenerBlockManagerAdded) = println( onBlockManagerAdded);
   override def onBlockManagerRemoved(e:
 SparkListenerBlockManagerRemoved) = println( onBlockManagerRemoved);
   override def onUnpersistRDD(e: SparkListenerUnpersistRDD) =
 println( onUnpersistRDD);
   override def onApplicationStart(e: SparkListenerApplicationStart)
 = println( onApplicationStart);
   override def onApplicationEnd(e: SparkListenerApplicationEnd) =
 println( onApplicationEnd);
   override def onExecutorMetricsUpdate(e:
 SparkListenerExecutorMetricsUpdate) = println(
 onExecutorMetricsUpdate);
 });

 sc.parallelize(List(1, 2, 3)).map(throw new
 SparkException(test)).collect();

 =

 On Fri, Apr 17, 2015 at 4:13 PM, Imran Rashid iras...@cloudera.com
 wrote:

 when you start the spark-shell, its already too late to get the
 ApplicationStart event.  Try listening for StageCompleted or JobEnd 
 instead.

 On Fri, Apr 17, 2015 at 5:54 PM, Praveen Balaji 
 secondorderpolynom...@gmail.com wrote:

 I'm trying to create a simple SparkListener to get notified of error
 on executors. I do not get any call backs on my SparkListener. Here some
 simple code I'm executing in spark-shell. But I still don't get any
 callbacks on my listener. Am I doing something wrong?

 Thanks for any clue you can send my way.

 Cheers
 Praveen

 ==
 import org.apache.spark.scheduler.SparkListener
 import org.apache.spark.scheduler.SparkListenerApplicationStart
 import org.apache.spark.scheduler.SparkListenerApplicationEnd
 import org.apache.spark.SparkException

 sc.addSparkListener(new SparkListener() {
   override def onApplicationStart(applicationStart:
 SparkListenerApplicationStart) {
 println( onApplicationStart:  +
 applicationStart.appName);
   }

   override def onApplicationEnd(applicationEnd:
 SparkListenerApplicationEnd) {
 println( onApplicationEnd:  + applicationEnd.time);
   }
 });

 sc.parallelize(List(1, 2, 3)).map(throw new
 SparkException(test)).collect();
 ===

 output:

 scala org.apache.spark.SparkException: hshsh
 at $iwC$$iwC$$iwC$$iwC.init(console:29)
 at $iwC$$iwC$$iwC.init(console:34)
 at $iwC$$iwC.init(console:36)
 at $iwC.init(console:38)








RE: compliation error

2015-04-19 Thread Brahma Reddy Battula
Hey Todd

Thanks a lot for your reply...Kindly check following details..

spark version :1.1.0
jdk:jdk1.7.0_60 ,
command:mvn -Pbigtop-dist  -Phive -Pyarn -Phadoop-2.4 
-Dhadoop.version=V100R001C00 -DskipTests package



Thanks  Regards



Brahma Reddy Battula


From: Ted Yu [yuzhih...@gmail.com]
Sent: Monday, April 20, 2015 8:07 AM
To: Brahma Reddy Battula
Cc: user@spark.apache.org
Subject: Re: compliation error

What JDK release are you using ?

Can you give the complete command you used ?

Which Spark branch are you working with ?

Cheers

On Sun, Apr 19, 2015 at 7:25 PM, Brahma Reddy Battula 
brahmareddy.batt...@huawei.commailto:brahmareddy.batt...@huawei.com wrote:
Hi All

Getting following error, when I am compiling spark..What did I miss..? Even 
googled and did not find the exact solution for this...


[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-shade-plugin:2.2:shade (default) on project 
spark-assembly_2.10: Error creating shaded jar: Error in ASM processing class 
com/ibm/icu/impl/data/LocaleElements_zh__PINYIN.class: 48188 - [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal 
org.apache.maven.plugins:maven-shade-plugin:2.2:shade (default) on project 
spark-assembly_2.10: Error creating shaded jar: Error in ASM processing class 
com/ibm/icu/impl/data/LocaleElements_zh__PINYIN.class
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:217)
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
at 
org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:84)
at 
org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:59)
at 
org.apache.maven.lifecycle.internal.LifecycleStarter.singleThreadedBuild(LifecycleStarter.java:183)
at 
org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:161)
at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:320)




Thanks  Regards

Brahma Reddy Battula






Re: [STREAMING KAFKA - Direct Approach] JavaPairRDD cannot be cast to HasOffsetRanges

2015-04-19 Thread Sean Owen
You need to access the underlying RDD with .rdd() and cast that. That
works for me.

On Mon, Apr 20, 2015 at 4:41 AM, RimBerry
truonghoanglinhk55b...@gmail.com wrote:
 Hi everyone,

 i am trying to use the direct approach  in  streaming-kafka-integration
 http://spark.apache.org/docs/latest/streaming-kafka-integration.html
 pulling data from kafka as follow

  JavaPairInputDStreamString, String messages =
 KafkaUatils.createDirectStream(jssc,
   
 String.class,
   
 String.class,
   
 StringDecoder.class,
   
 StringDecoder.class,
   
 kafkaParams,
   
 topicsSet);

 messages.foreachRDD(
 new FunctionJavaPairRDDlt;String,String, Void() {
 @Override
  public Void call(JavaPairRDDString, 
 String rdd) throws IOException {
 OffsetRange[] offsetRanges = 
 ((HasOffsetRanges) rdd).offsetRanges();
 //.
 return null;
 }
 }
 );

 then i got an error when running it
 *java.lang.ClassCastException: org.apache.spark.api.java.JavaPairRDD cannot
 be cast to org.apache.spark.streaming.kafka.HasOffsetRanges* at
 OffsetRange[] offsetRanges = ((HasOffsetRanges) rdd).offsetRanges();

 i am using the version 1.3.1 if is it a bug in this version ?

 Thank you for spending time with me.




 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/STREAMING-KAFKA-Direct-Approach-JavaPairRDD-cannot-be-cast-to-HasOffsetRanges-tp22568.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: compliation error

2015-04-19 Thread Sean Owen
Brahma since you can see the continuous integration builds are
passing, it's got to be something specific to your environment, right?
this is not even an error from Spark, but from Maven plugins.

On Mon, Apr 20, 2015 at 4:42 AM, Ted Yu yuzhih...@gmail.com wrote:
 bq. -Dhadoop.version=V100R001C00

 First time I saw above hadoop version. Doesn't look like Apache release.

 I checked my local maven repo but didn't find impl under
 ~/.m2/repository/com/ibm/icu

 FYI

 On Sun, Apr 19, 2015 at 8:04 PM, Brahma Reddy Battula
 brahmareddy.batt...@huawei.com wrote:

 Hey Todd

 Thanks a lot for your reply...Kindly check following details..

 spark version :1.1.0
 jdk:jdk1.7.0_60 ,
 command:mvn -Pbigtop-dist  -Phive -Pyarn -Phadoop-2.4
 -Dhadoop.version=V100R001C00 -DskipTests package


 Thanks  Regards



 Brahma Reddy Battula


 
 From: Ted Yu [yuzhih...@gmail.com]
 Sent: Monday, April 20, 2015 8:07 AM
 To: Brahma Reddy Battula
 Cc: user@spark.apache.org
 Subject: Re: compliation error

 What JDK release are you using ?

 Can you give the complete command you used ?

 Which Spark branch are you working with ?

 Cheers

 On Sun, Apr 19, 2015 at 7:25 PM, Brahma Reddy Battula
 brahmareddy.batt...@huawei.com wrote:

 Hi All

 Getting following error, when I am compiling spark..What did I miss..?
 Even googled and did not find the exact solution for this...


 [ERROR] Failed to execute goal
 org.apache.maven.plugins:maven-shade-plugin:2.2:shade (default) on project
 spark-assembly_2.10: Error creating shaded jar: Error in ASM processing
 class com/ibm/icu/impl/data/LocaleElements_zh__PINYIN.class: 48188 - [Help
 1]
 org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
 goal org.apache.maven.plugins:maven-shade-plugin:2.2:shade (default) on
 project spark-assembly_2.10: Error creating shaded jar: Error in ASM
 processing class com/ibm/icu/impl/data/LocaleElements_zh__PINYIN.class
 at
 org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:217)
 at
 org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
 at
 org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
 at
 org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:84)
 at
 org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:59)
 at
 org.apache.maven.lifecycle.internal.LifecycleStarter.singleThreadedBuild(LifecycleStarter.java:183)
 at
 org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:161)
 at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:320)



 Thanks  Regards

 Brahma Reddy Battula







-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Code Deployment tools in Production

2015-04-19 Thread Arun Patel
Generally what tools are used to schedule spark jobs in production?

How is spark streaming code is deployed?

I am interested in knowing the tools used like cron, oozie etc.

Thanks,
Arun


[STREAMING KAFKA - Direct Approach] JavaPairRDD cannot be cast to HasOffsetRanges

2015-04-19 Thread RimBerry
Hi everyone,

i am trying to use the direct approach  in  streaming-kafka-integration
http://spark.apache.org/docs/latest/streaming-kafka-integration.html  
pulling data from kafka as follow

 JavaPairInputDStreamString, String messages = 
KafkaUatils.createDirectStream(jssc, 

  String.class, 

  String.class, 

  StringDecoder.class, 

  StringDecoder.class, 

  kafkaParams, 

  topicsSet);

messages.foreachRDD(
new FunctionJavaPairRDDlt;String,String, Void() {
@Override
 public Void call(JavaPairRDDString, 
String rdd) throws IOException {
OffsetRange[] offsetRanges = 
((HasOffsetRanges) rdd).offsetRanges();
//.
return null;
}
}
);

then i got an error when running it
*java.lang.ClassCastException: org.apache.spark.api.java.JavaPairRDD cannot
be cast to org.apache.spark.streaming.kafka.HasOffsetRanges* at
OffsetRange[] offsetRanges = ((HasOffsetRanges) rdd).offsetRanges();

i am using the version 1.3.1 if is it a bug in this version ?

Thank you for spending time with me.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/STREAMING-KAFKA-Direct-Approach-JavaPairRDD-cannot-be-cast-to-HasOffsetRanges-tp22568.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: how to make a spark cluster ?

2015-04-19 Thread Jörn Franke
Hi, If you have just one physical machine then I would try out Docker
instead of a full VM (would be waste of memory and CPU).

Best regards
Le 20 avr. 2015 00:11, hnahak harihar1...@gmail.com a écrit :

 Hi All,

 I've big physical machine with 16 CPUs , 256 GB RAM, 20 TB Hard disk. I
 just
 need to know what should be the best solution to make a spark cluster?

 If I need to process TBs of data then
 1. Only one machine, which contain driver, executor, job tracker and task
 tracker everything.
 2. create 4 VMs and each VM should consist 4 CPUs , 64 GB RAM
 3. create 8 VMs and each VM should consist 2 CPUs , 32 GB RAM each

 please give me your views/suggestions



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/how-to-make-a-spark-cluster-tp22563.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




RE: compliation error

2015-04-19 Thread Brahma Reddy Battula
Thanks a lot for your replies..

@Ted,V100R001C00 this is our internal hadoop version which is based on hadoop 
2.4.1..

@Sean Owen,Yes, you are correct...Just I wanted to know, what leads this 
problem...


Thanks  Regards
Brahma Reddy Battula

From: Sean Owen [so...@cloudera.com]
Sent: Monday, April 20, 2015 9:14 AM
To: Ted Yu
Cc: Brahma Reddy Battula; user@spark.apache.org
Subject: Re: compliation error

Brahma since you can see the continuous integration builds are
passing, it's got to be something specific to your environment, right?
this is not even an error from Spark, but from Maven plugins.

On Mon, Apr 20, 2015 at 4:42 AM, Ted Yu yuzhih...@gmail.com wrote:
 bq. -Dhadoop.version=V100R001C00

 First time I saw above hadoop version. Doesn't look like Apache release.

 I checked my local maven repo but didn't find impl under
 ~/.m2/repository/com/ibm/icu

 FYI

 On Sun, Apr 19, 2015 at 8:04 PM, Brahma Reddy Battula
 brahmareddy.batt...@huawei.com wrote:

 Hey Todd

 Thanks a lot for your reply...Kindly check following details..

 spark version :1.1.0
 jdk:jdk1.7.0_60 ,
 command:mvn -Pbigtop-dist  -Phive -Pyarn -Phadoop-2.4
 -Dhadoop.version=V100R001C00 -DskipTests package


 Thanks  Regards



 Brahma Reddy Battula


 
 From: Ted Yu [yuzhih...@gmail.com]
 Sent: Monday, April 20, 2015 8:07 AM
 To: Brahma Reddy Battula
 Cc: user@spark.apache.org
 Subject: Re: compliation error

 What JDK release are you using ?

 Can you give the complete command you used ?

 Which Spark branch are you working with ?

 Cheers

 On Sun, Apr 19, 2015 at 7:25 PM, Brahma Reddy Battula
 brahmareddy.batt...@huawei.com wrote:

 Hi All

 Getting following error, when I am compiling spark..What did I miss..?
 Even googled and did not find the exact solution for this...


 [ERROR] Failed to execute goal
 org.apache.maven.plugins:maven-shade-plugin:2.2:shade (default) on project
 spark-assembly_2.10: Error creating shaded jar: Error in ASM processing
 class com/ibm/icu/impl/data/LocaleElements_zh__PINYIN.class: 48188 - [Help
 1]
 org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
 goal org.apache.maven.plugins:maven-shade-plugin:2.2:shade (default) on
 project spark-assembly_2.10: Error creating shaded jar: Error in ASM
 processing class com/ibm/icu/impl/data/LocaleElements_zh__PINYIN.class
 at
 org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:217)
 at
 org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
 at
 org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
 at
 org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:84)
 at
 org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:59)
 at
 org.apache.maven.lifecycle.internal.LifecycleStarter.singleThreadedBuild(LifecycleStarter.java:183)
 at
 org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:161)
 at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:320)



 Thanks  Regards

 Brahma Reddy Battula







-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



SparkStreaming onStart not being invoked on CustomReceiver attached to master with multiple workers

2015-04-19 Thread Ankit Patel




I am experiencing problem with SparkStreaming (Spark 1.2.0), the onStart method 
is never called on CustomReceiver when calling spark-submit against a master 
node with multiple workers. However, SparkStreaming works fine with no master 
node set. Anyone notice this issue? 
   

Re: Code Deployment tools in Production

2015-04-19 Thread Vova Shelgunov
On 20 Apr 2015 05:45, Arun Patel arunp.bigd...@gmail.com wrote:

http://23.251.129.190:8090/spark-twitter-streaming-web/analysis/3fb28f76-62fe-47f3-a1a8-66ac610c2447.html
spark jobs in production?

 How is spark streaming code is deployed?

 I am interested in knowing the tools used like cron, oozie etc.

 Thanks,
 Arun


Re: Dataframes Question

2015-04-19 Thread Ted Yu
That's right.

On Sun, Apr 19, 2015 at 8:59 AM, Arun Patel arunp.bigd...@gmail.com wrote:

 Thanks Ted.

 So, whatever the operations I am performing now are DataFrames and not
 SchemaRDD?  Is that right?

 Regards,
 Venkat

 On Sun, Apr 19, 2015 at 9:13 AM, Ted Yu yuzhih...@gmail.com wrote:

 bq. SchemaRDD is not existing in 1.3?

 That's right.

 See this thread for more background:

 http://search-hadoop.com/m/JW1q5zQ1Xw/spark+DataFrame+schemarddsubj=renaming+SchemaRDD+gt+DataFrame



 On Sat, Apr 18, 2015 at 5:43 PM, Abhishek R. Singh 
 abhis...@tetrationanalytics.com wrote:

 I am no expert myself, but from what I understand DataFrame is
 grandfathering SchemaRDD. This was done for API stability as spark sql
 matured out of alpha as part of 1.3.0 release.

 It is forward looking and brings (dataframe like) syntax that was not
 available with the older schema RDD.

 On Apr 18, 2015, at 4:43 PM, Arun Patel arunp.bigd...@gmail.com wrote:

  Experts,
 
  I have few basic questions on DataFrames vs Spark SQL.  My confusion
 is more with DataFrames.
 
  1)  What is the difference between Spark SQL and DataFrames?  Are they
 same?
  2)  Documentation says SchemaRDD is renamed as DataFrame. This means
 SchemaRDD is not existing in 1.3?
  3)  As per documentation, it looks like creating dataframe is no
 different than SchemaRDD -  df =
 sqlContext.jsonFile(examples/src/main/resources/people.json).
  So, my question is what is the difference?
 
  Thanks for your help.
 
  Arun


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org






Re: Dataframes Question

2015-04-19 Thread Ted Yu
bq. SchemaRDD is not existing in 1.3?

That's right.

See this thread for more background:
http://search-hadoop.com/m/JW1q5zQ1Xw/spark+DataFrame+schemarddsubj=renaming+SchemaRDD+gt+DataFrame



On Sat, Apr 18, 2015 at 5:43 PM, Abhishek R. Singh 
abhis...@tetrationanalytics.com wrote:

 I am no expert myself, but from what I understand DataFrame is
 grandfathering SchemaRDD. This was done for API stability as spark sql
 matured out of alpha as part of 1.3.0 release.

 It is forward looking and brings (dataframe like) syntax that was not
 available with the older schema RDD.

 On Apr 18, 2015, at 4:43 PM, Arun Patel arunp.bigd...@gmail.com wrote:

  Experts,
 
  I have few basic questions on DataFrames vs Spark SQL.  My confusion is
 more with DataFrames.
 
  1)  What is the difference between Spark SQL and DataFrames?  Are they
 same?
  2)  Documentation says SchemaRDD is renamed as DataFrame. This means
 SchemaRDD is not existing in 1.3?
  3)  As per documentation, it looks like creating dataframe is no
 different than SchemaRDD -  df =
 sqlContext.jsonFile(examples/src/main/resources/people.json).
  So, my question is what is the difference?
 
  Thanks for your help.
 
  Arun


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Dataframes Question

2015-04-19 Thread Arun Patel
Thanks Ted.

So, whatever the operations I am performing now are DataFrames and not
SchemaRDD?  Is that right?

Regards,
Venkat

On Sun, Apr 19, 2015 at 9:13 AM, Ted Yu yuzhih...@gmail.com wrote:

 bq. SchemaRDD is not existing in 1.3?

 That's right.

 See this thread for more background:

 http://search-hadoop.com/m/JW1q5zQ1Xw/spark+DataFrame+schemarddsubj=renaming+SchemaRDD+gt+DataFrame



 On Sat, Apr 18, 2015 at 5:43 PM, Abhishek R. Singh 
 abhis...@tetrationanalytics.com wrote:

 I am no expert myself, but from what I understand DataFrame is
 grandfathering SchemaRDD. This was done for API stability as spark sql
 matured out of alpha as part of 1.3.0 release.

 It is forward looking and brings (dataframe like) syntax that was not
 available with the older schema RDD.

 On Apr 18, 2015, at 4:43 PM, Arun Patel arunp.bigd...@gmail.com wrote:

  Experts,
 
  I have few basic questions on DataFrames vs Spark SQL.  My confusion is
 more with DataFrames.
 
  1)  What is the difference between Spark SQL and DataFrames?  Are they
 same?
  2)  Documentation says SchemaRDD is renamed as DataFrame. This means
 SchemaRDD is not existing in 1.3?
  3)  As per documentation, it looks like creating dataframe is no
 different than SchemaRDD -  df =
 sqlContext.jsonFile(examples/src/main/resources/people.json).
  So, my question is what is the difference?
 
  Thanks for your help.
 
  Arun


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: Date class not supported by SparkSQL

2015-04-19 Thread Lior Chaga
Here's a code example:

public class DateSparkSQLExample {

public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName(test).setMaster(local);
JavaSparkContext sc = new JavaSparkContext(conf);

ListSomeObject itemsList = Lists.newArrayListWithCapacity(1);
itemsList.add(new SomeObject(new Date(), 1L));
JavaRDDSomeObject someObjectJavaRDD = sc.parallelize(itemsList);

JavaSQLContext sqlContext = new
org.apache.spark.sql.api.java.JavaSQLContext(sc);
sqlContext.applySchema(someObjectJavaRDD,
SomeObject.class).registerTempTable(temp_table);
}

private static class SomeObject implements Serializable{
private Date timestamp;
private Long value;

public SomeObject() {
}

public SomeObject(Date timestamp, Long value) {
this.timestamp = timestamp;
this.value = value;
}

public Date getTimestamp() {
return timestamp;
}

public void setTimestamp(Date timestamp) {
this.timestamp = timestamp;
}

public Long getValue() {
return value;
}

public void setValue(Long value) {
this.value = value;
}
}
}


On Sun, Apr 19, 2015 at 4:27 PM, Lior Chaga lio...@taboola.com wrote:

 Using Spark 1.2.0. Tried to apply register an RDD and got:
 scala.MatchError: class java.util.Date (of class java.lang.Class)

 I see it was resolved in https://issues.apache.org/jira/browse/SPARK-2562
 (included in 1.2.0)

 Anyone encountered this issue?

 Thanks,
 Lior