Re: Spark streaming with Kafka- couldnt find KafkaUtils

2015-04-04 Thread Akhil Das
How are you submitting the application? Use a standard build tool like
maven or sbt to build your project, it will download all the dependency
jars, when you submit your application (if you are using spark-submit, then
use --jars option to add those jars which are causing
classNotFoundException). If you are running as a standalone application
without using spark-submit, then while creating the SparkContext, use
sc.addJar() to add those dependency jars.

For Kafka streaming, when you use sbt, these will be jars that are required:


sc.addJar("/root/.ivy2/cache/org.apache.spark/spark-streaming-kafka_2.10/jars/spark-streaming-kafka_2.10-1.1.0.jar")
   
sc.addJar("/root/.ivy2/cache/com.yammer.metrics/metrics-core/jars/metrics-core-2.2.0.jar")
   
sc.addJar("/root/.ivy2/cache/org.apache.kafka/kafka_2.10/jars/kafka_2.10-0.8.0.jar")
   sc.addJar("/root/.ivy2/cache/com.101tec/zkclient/jars/zkclient-0.3.jar")




Thanks
Best Regards

On Sun, Apr 5, 2015 at 12:00 PM, Priya Ch 
wrote:

> Hi All,
>
>   I configured Kafka  cluster on a  single node and I have streaming
> application which reads data from kafka topic using KafkaUtils. When I
> execute the code in local mode from the IDE, the application runs fine.
>
> But when I submit the same to spark cluster in standalone mode, I end up
> with the following exception:
> java.lang.ClassNotFoundException:
> org/apache/spark/streaming/kafka/KafkaUtils.
>
> I am using spark-1.2.1 version. when i checked the source files of
> streaming, the source files related to kafka are missing. Are these not
> included in spark-1.3.0 and spark-1.2.1 versions ?
>
> Have to manually include these ??
>
> Regards,
> Padma Ch
>


Re: Spark Streaming program questions

2015-04-04 Thread Sean Owen
The DAG can't change. You can create many DStreams, but they have to
belong to one StreamingContext. You can try these things to see.

On Sun, Apr 5, 2015 at 2:13 AM, nickos168  wrote:
> I have two questions:
>
> 1) In a Spark Streaming program, after the various DStream transformations
> have being setup,
> the ssc.start() method is called to start the computation.
>
> Can the underlying DAG change (ie. add another map or maybe a join) after
> ssc.start() has been
> called (and maybe messages have already been received/processed for some
> batches)?
>
>
> 2) In a Spark Streaming program (one process), can I have multiple DStream
> transformations,
> each series belonging to each own StreamingContext (in the same thread or in
> different threads)?
>
> For example:
>  val lines_A = ssc_A.socketTextStream(..)
>  val words_A = lines_A.flatMap(_.split(" "))
>  val wordCounts_A = words_A.map(x => (x, 1)).reduceByKey(_ + _)
> wordCounts_A.print()
>
> val lines_B = ssc_B.socketTextStream(..)
> val words_B = lines_B.flatMap(_.split(" "))
> val wordCounts_B = words_B.map(x => (x, 1)).reduceByKey(_ + _)
>
> wordCounts_B.print()
>
> ssc_A.start()
>   ssc_B.start()
>
> I think the answer is NO to both questions but I am wondering what is the
> reason for this behavior.
>
>
> Thanks,
>
> Nickos
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark streaming with Kafka- couldnt find KafkaUtils

2015-04-04 Thread Priya Ch
Hi All,

  I configured Kafka  cluster on a  single node and I have streaming
application which reads data from kafka topic using KafkaUtils. When I
execute the code in local mode from the IDE, the application runs fine.

But when I submit the same to spark cluster in standalone mode, I end up
with the following exception:
java.lang.ClassNotFoundException:
org/apache/spark/streaming/kafka/KafkaUtils.

I am using spark-1.2.1 version. when i checked the source files of
streaming, the source files related to kafka are missing. Are these not
included in spark-1.3.0 and spark-1.2.1 versions ?

Have to manually include these ??

Regards,
Padma Ch


Re: Spark SQL Self join with agreegate

2015-04-04 Thread SachinJanani
I am not sure whether this can be possible but i have tried something like 
SELECT time, src, dst, sum(val1), sum(val2) from table group by
src,dst;
and it works.I think it will result the same answer as you are expecting



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Self-join-with-agreegate-tp22151p22378.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Streaming program questions

2015-04-04 Thread Aj K
UNSUBSCRIBE

On Sun, Apr 5, 2015 at 6:43 AM, nickos168 
wrote:

> I have two questions:
>
> 1) In a Spark Streaming program, after the various DStream transformations
> have being setup,
> the ssc.start() method is called to start the computation.
>
> Can the underlying DAG change (ie. add another map or maybe a join) after
> ssc.start() has been
> called (and maybe messages have already been received/processed for some
> batches)?
>
>
> 2) In a Spark Streaming program (one process), can I have multiple DStream
> transformations,
> each series belonging to each own StreamingContext (in the same thread or
> in different threads)?
>
> For example:
>  val lines_A = ssc_A.socketTextStream(..)val words_A =
> lines_A.flatMap(_.split(" "))val wordCounts_A = words_A.map(x => (x, 
> 1)).reduceByKey(_
> + _)
> wordCounts_A.print()
>
> val lines_B = ssc_B.socketTextStream(..)val words_B =
>  lines_B.flatMap(_.split(" "))val wordCounts_B = words_B.map(x => (x, 1
> )).reduceByKey(_ + _)
>
> wordCounts_B.print()
>
> ssc_A.start()
>   ssc_B.start()
>
> I think the answer is NO to both questions but I am wondering what is the
> reason for this behavior.
>
>
> Thanks,
>
> Nickos
>
>


Spark Streaming program questions

2015-04-04 Thread nickos168
I have two questions:

1) In a Spark Streaming program, after the various DStream transformations have 
being setup, 
the ssc.start() method is called to start the computation.

Can the underlying DAG change (ie. add another map or maybe a join) after 
ssc.start() has been 
called (and maybe messages have already been received/processed for some 
batches)?


2) In a Spark Streaming program (one process), can I have multiple DStream 
transformations, 
each series belonging to each own StreamingContext (in the same thread or in 
different threads)?

For example:
 val lines_A = ssc_A.socketTextStream(..)
 val words_A = lines_A.flatMap(_.split(" "))
 val wordCounts_A = words_A.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts_A.print()

val lines_B = ssc_B.socketTextStream(..)
val words_B = lines_B.flatMap(_.split(" "))
val wordCounts_B = words_B.map(x => (x, 1)).reduceByKey(_ + _)

wordCounts_B.print()

ssc_A.start()
  ssc_B.start()

I think the answer is NO to both questions but I am wondering what is the 
reason for this behavior.


Thanks,

Nickos



DataFrame groupBy MapType

2015-04-04 Thread Justin Yip
Hello,

I have a case class like this:

case class A(
  m: Map[Long, Long],
  ...
)

and constructed a DataFrame from Seq[A].

I would like to perform a groupBy on A.m("SomeKey"). I can implement a UDF,
create a new Column then invoke a groupBy on the new Column. But is it the
idiomatic way of doing such operation?

Can't find much info about operating MapType on Column in the doc.

Thanks ahead!

Justin


CPU Usage for Spark Local Mode

2015-04-04 Thread Wenlei Xie
Hi,

I am currently testing my application with Spark under local mode, and I
set the master to be local[4]. One thing I note is that when there is
groupBy/reduceBy operation involved, the CPU usage can sometimes be around
600% to 800%. I am wondering if this is expected? (As only 4 worker threads
are assigned, together with the driver thread, it should be 500%?)

Best,
Wenlei


Re: UNRESOLVED DEPENDENCIES while building Spark 1.3.0

2015-04-04 Thread Dean Wampler
Use the MVN build instead. From the README in the git repo (
https://github.com/apache/spark)

mvn -DskipTests clean package



Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
 (O'Reilly)
Typesafe 
@deanwampler 
http://polyglotprogramming.com

On Sat, Apr 4, 2015 at 4:39 PM, mas  wrote:

> Hi All,
> I am trying to build spark 1.3.O on standalone Ubuntu 14.04. I am using the
> sbt command i.e. "sbt/sbt assembly" to build it. This command works pretty
> good with spark version 1.1 however, it gives following error with spark
> 1.3.0. Any help or suggestions to resolve this would highly be appreciated.
>
> [info] Done updating.
> [info] Updating {file:/home/roott/aamirTest/spark/}network-shuffle...
> [info] Resolving org.fusesource.jansi#jansi;1.4 ...
> [warn]  ::
> [warn]  ::  UNRESOLVED DEPENDENCIES ::
> [warn]  ::
> [warn]  :: org.apache.spark#spark-network-common_2.10;1.3.0: configuration
> not p
> ublic in org.apache.spark#spark-network-common_2.10;1.3.0: 'test'. It was
> requir
> ed from org.apache.spark#spark-network-shuffle_2.10;1.3.0 test
> [warn]  ::
> [warn]
> [warn]  Note: Unresolved dependencies path:
> [warn]  org.apache.spark:spark-network-common_2.10:1.3.0
> ((com.typesafe.
> sbt.pom.MavenHelper) MavenHelper.scala#L76)
> [warn]+- org.apache.spark:spark-network-shuffle_2.10:1.3.0
> sbt.ResolveException: unresolved dependency:
> org.apache.spark#spark-network-comm
> on_2.10;1.3.0: configuration not public in
> org.apache.spark#spark-network-common
> _2.10;1.3.0: 'test'. It was required from
> org.apache.spark#spark-network-shuffle
> _2.10;1.3.0 test
> at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:278)
> at
> sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:175)
> at
> sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:157)
> at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:151)
> at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:151)
> at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:128)
> at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:56)
> at sbt.IvySbt$$anon$4.call(Ivy.scala:64)
> at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:93)
> at
> xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRet
> ries$1(Locks.scala:78)
> at
> xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:
> 97)
> at xsbt.boot.Using$.withResource(Using.scala:10)
> at xsbt.boot.Using$.apply(Using.scala:9)
> at
> xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:58)
> at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:48)
> at xsbt.boot.Locks$.apply0(Locks.scala:31)
> at xsbt.boot.Locks$.apply(Locks.scala:28)
> at sbt.IvySbt.withDefaultLogger(Ivy.scala:64)
> at sbt.IvySbt.withIvy(Ivy.scala:123)
> at sbt.IvySbt.withIvy(Ivy.scala:120)
> at sbt.IvySbt$Module.withModule(Ivy.scala:151)
> at sbt.IvyActions$.updateEither(IvyActions.scala:157)
> at
> sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala
> :1318)
> at
> sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala
> :1315)
> at
> sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$85.apply(Defaults.scala:1
> 345)
> at
> sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$85.apply(Defaults.scala:1
> 343)
> at sbt.Tracked$$anonfun$lastOutput$1.apply(Tracked.scala:35)
> at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1348)
> at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1342)
> at sbt.Tracked$$anonfun$inputChanged$1.apply(Tracked.scala:45)
> at sbt.Classpaths$.cachedUpdate(Defaults.scala:1360)
> at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1300)
> at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1275)
> at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47)
> at
> sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:40)
> at sbt.std.Transform$$anon$4.work(System.scala:63)
> at
> sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:22
> 6)
> at
> sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:22
> 6)
> at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:17)
> at sbt.Execute.work(Execute.scala:235)
> at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:226)
> at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:226)
> at
> sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestric
> tions.scala:159)
> at sbt.CompletionS

UNRESOLVED DEPENDENCIES while building Spark 1.3.0

2015-04-04 Thread mas
Hi All,
I am trying to build spark 1.3.O on standalone Ubuntu 14.04. I am using the
sbt command i.e. "sbt/sbt assembly" to build it. This command works pretty
good with spark version 1.1 however, it gives following error with spark
1.3.0. Any help or suggestions to resolve this would highly be appreciated.

[info] Done updating.
[info] Updating {file:/home/roott/aamirTest/spark/}network-shuffle...
[info] Resolving org.fusesource.jansi#jansi;1.4 ...
[warn]  ::
[warn]  ::  UNRESOLVED DEPENDENCIES ::
[warn]  ::
[warn]  :: org.apache.spark#spark-network-common_2.10;1.3.0: configuration
not p   
 
ublic in org.apache.spark#spark-network-common_2.10;1.3.0: 'test'. It was
requir  
  
ed from org.apache.spark#spark-network-shuffle_2.10;1.3.0 test
[warn]  ::
[warn]
[warn]  Note: Unresolved dependencies path:
[warn]  org.apache.spark:spark-network-common_2.10:1.3.0
((com.typesafe. 
   
sbt.pom.MavenHelper) MavenHelper.scala#L76)
[warn]+- org.apache.spark:spark-network-shuffle_2.10:1.3.0
sbt.ResolveException: unresolved dependency:
org.apache.spark#spark-network-comm 
   
on_2.10;1.3.0: configuration not public in
org.apache.spark#spark-network-common   
 
_2.10;1.3.0: 'test'. It was required from
org.apache.spark#spark-network-shuffle  
  
_2.10;1.3.0 test
at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:278)
at
sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:175)
at
sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:157)
at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:151)
at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:151)
at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:128)
at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:56)
at sbt.IvySbt$$anon$4.call(Ivy.scala:64)
at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:93)
at
xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRet   

 
ries$1(Locks.scala:78)
at
xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:   

 
97)
at xsbt.boot.Using$.withResource(Using.scala:10)
at xsbt.boot.Using$.apply(Using.scala:9)
at
xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:58)
at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:48)
at xsbt.boot.Locks$.apply0(Locks.scala:31)
at xsbt.boot.Locks$.apply(Locks.scala:28)
at sbt.IvySbt.withDefaultLogger(Ivy.scala:64)
at sbt.IvySbt.withIvy(Ivy.scala:123)
at sbt.IvySbt.withIvy(Ivy.scala:120)
at sbt.IvySbt$Module.withModule(Ivy.scala:151)
at sbt.IvyActions$.updateEither(IvyActions.scala:157)
at
sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala   

 
:1318)
at
sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala   

 
:1315)
at
sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$85.apply(Defaults.scala:1   

 
345)
at
sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$85.apply(Defaults.scala:1   

 
343)
at sbt.Tracked$$anonfun$lastOutput$1.apply(Tracked.scala:35)
at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1348)
at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1342)
at sbt.Tracked$$anonfun$inputChanged$1.apply(Tracked.scala:45)
at sbt.Classpaths$.cachedUpdate(Defaults.scala:1360)
at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1300)
at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1275)
at scala.Function1$$

UNRESOLVED DEPENDENCIES while building Spark 1.3.0

2015-04-04 Thread mas
Hi All,

I am trying to build spark 1.3.0 on Ubuntu 14.04 Stand alone machine. I am
using "sbt/sbt assembly" command to build it. However, this command works
pretty fine with spark version "1.1.0" but for Spark 1.3 it gives following
error.
Any help or suggestions to resolve this problem will highly be appreciated.

] Resolving org.fusesource.jansi#jansi;1.4 ...
[warn]  ::
[warn]  ::  UNRESOLVED DEPENDENCIES ::
[warn]  ::
[warn]  :: org.apache.spark#spark-network-common_2.10;1.3.0: configuration
not p   
 
ublic in org.apache.spark#spark-network-common_2.10;1.3.0: 'test'. It was
requir  
  
ed from org.apache.spark#spark-network-shuffle_2.10;1.3.0 test
[warn]  ::
[warn]
[warn]  Note: Unresolved dependencies path:
[warn]  org.apache.spark:spark-network-common_2.10:1.3.0
((com.typesafe. 
   
sbt.pom.MavenHelper) MavenHelper.scala#L76)
[warn]+- org.apache.spark:spark-network-shuffle_2.10:1.3.0
sbt.ResolveException: unresolved dependency:
org.apache.spark#spark-network-comm 
   
on_2.10;1.3.0: configuration not public in
org.apache.spark#spark-network-common   
 
_2.10;1.3.0: 'test'. It was required from
org.apache.spark#spark-network-shuffle  
  
_2.10;1.3.0 test
at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:278)
at
sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:175)
at
sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:157)
at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:151)
at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:151)
at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:128)
at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:56)
at sbt.IvySbt$$anon$4.call(Ivy.scala:64)
at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:93)
at
xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRet   

 
ries$1(Locks.scala:78)
at
xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:   

 
97)
at xsbt.boot.Using$.withResource(Using.scala:10)
at xsbt.boot.Using$.apply(Using.scala:9)
at
xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:58)
at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:48)
at xsbt.boot.Locks$.apply0(Locks.scala:31)
at xsbt.boot.Locks$.apply(Locks.scala:28)
at sbt.IvySbt.withDefaultLogger(Ivy.scala:64)
at sbt.IvySbt.withIvy(Ivy.scala:123)
at sbt.IvySbt.withIvy(Ivy.scala:120)
at sbt.IvySbt$Module.withModule(Ivy.scala:151)
at sbt.IvyActions$.updateEither(IvyActions.scala:157)
at
sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala   

 
:1318)
at
sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala   

 
:1315)
at
sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$85.apply(Defaults.scala:1   

 
345)
at
sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$85.apply(Defaults.scala:1   

 
343)
at sbt.Tracked$$anonfun$lastOutput$1.apply(Tracked.scala:35)
at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1348)
at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1342)
at sbt.Tracked$$anonfun$inputChanged$1.apply(Tracked.scala:45)
at sbt.Classpaths$.cachedUpdate(Defaults.scala:1360)
at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1300)
at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1275)
at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47)
at
sbt.$tilde$greater$$anonfun$$u221

Re: Spark + Kinesis

2015-04-04 Thread Vadim Bichutskiy
Hi all,

More good news! I was able to utilize mergeStrategy to assembly my Kinesis
consumer into an "uber jar"

Here's what I added to* build.sbt:*

*mergeStrategy in assembly <<= (mergeStrategy in assembly) { (old) =>*
*  {*
*  case PathList("com", "esotericsoftware", "minlog", xs @ _*) =>
MergeStrategy.first*
*  case PathList("com", "google", "common", "base", xs @ _*) =>
MergeStrategy.first*
*  case PathList("org", "apache", "commons", xs @ _*) => MergeStrategy.last*
*  case PathList("org", "apache", "hadoop", xs @ _*) => MergeStrategy.first*
*  case PathList("org", "apache", "spark", "unused", xs @ _*) =>
MergeStrategy.first*
*case x => old(x)*
*  }*
*}*

Everything appears to be working fine. Right now my producer is pushing
simple strings through Kinesis,
which my consumer is trying to print (using Spark's print() method for now).

However, instead of displaying my strings, I get the following:

*15/04/04 18:57:32 INFO scheduler.ReceivedBlockTracker: Deleting batches
ArrayBuffer(1428173848000 ms)*

Any idea on what might be going on?

Thanks,

Vadim

Here's my consumer code (adapted from the WordCount example):























































































*private object MyConsumer extends Logging {  def main(args: Array[String])
{/* Check that all required args were passed in. */if (args.length
< 2) {  System.err.println("""  |Usage:
KinesisWordCount|
is the name of the Kinesis stream  | is the
endpoint of the Kinesis service  |   (e.g.
https://kinesis.us-east-1.amazonaws.com
)""".stripMargin)
System.exit(1)}/* Populate the appropriate variables from the given
args */val Array(streamName, endpointUrl) = args/* Determine the
number of shards from the stream */val kinesisClient = new
AmazonKinesisClient(new DefaultAWSCredentialsProviderChain())
kinesisClient.setEndpoint(endpointUrl)val numShards =
kinesisClient.describeStream(streamName).getStreamDescription().getShards()
.size()System.out.println("Num shards: " + numShards)/* In this
example, we're going to create 1 Kinesis Worker/Receiver/DStream for each
shard. */val numStreams = numShards/* Setup the and SparkConfig and
StreamingContext *//* Spark Streaming batch interval */val
batchInterval = Milliseconds(2000)val sparkConfig = new
SparkConf().setAppName("MyConsumer")val ssc = new
StreamingContext(sparkConfig, batchInterval)/* Kinesis checkpoint
interval.  Same as batchInterval for this example. */val
kinesisCheckpointInterval = batchInterval/* Create the same number of
Kinesis DStreams/Receivers as Kinesis stream's shards */val
kinesisStreams = (0 until numStreams).map { i =>
KinesisUtils.createStream(ssc, streamName, endpointUrl,
kinesisCheckpointInterval,  InitialPositionInStream.LATEST,
StorageLevel.MEMORY_AND_DISK_2)}/* Union all the streams */val
unionStreams  = ssc.union(kinesisStreams).map(byteArray => new
String(byteArray))unionStreams.print()ssc.start()
ssc.awaitTermination()  }}*

ᐧ

On Fri, Apr 3, 2015 at 3:48 PM, Tathagata Das  wrote:

> Just remove "provided" for spark-streaming-kinesis-asl
>
> libraryDependencies += "org.apache.spark" %% "spark-streaming-kinesis-asl"
> % "1.3.0"
>
> On Fri, Apr 3, 2015 at 12:45 PM, Vadim Bichutskiy <
> vadim.bichuts...@gmail.com> wrote:
>
>> Thanks. So how do I fix it?
>> ᐧ
>>
>> On Fri, Apr 3, 2015 at 3:43 PM, Kelly, Jonathan 
>> wrote:
>>
>>>   spark-streaming-kinesis-asl is not part of the Spark distribution on
>>> your cluster, so you cannot have it be just a "provided" dependency.  This
>>> is also why the KCL and its dependencies were not included in the assembly
>>> (but yes, they should be).
>>>
>>>
>>>  ~ Jonathan Kelly
>>>
>>>   From: Vadim Bichutskiy 
>>> Date: Friday, April 3, 2015 at 12:26 PM
>>> To: Jonathan Kelly 
>>> Cc: "user@spark.apache.org" 
>>> Subject: Re: Spark + Kinesis
>>>
>>>   Hi all,
>>>
>>>  Good news! I was able to create a Kinesis consumer and assemble it
>>> into an "uber jar" following
>>> http://spark.apache.org/docs/latest/streaming-kinesis-integration.html
>>> and example
>>> https://github.com/apache/spark/blob/master/extras/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala
>>> .
>>>
>>>  However when I try to spark-submit it I get the following exception:
>>>
>>>  *Exception in thread "main" java.lang.NoClassDefFoundError:
>>> com/amazonaws/auth/AWSCredentialsProvider*
>>>
>>>  Do I need to include KCL dependency in *build.sbt*, here's what it
>>> looks like currently:
>>>
>>>  import AssemblyKeys._
>>> name := "Kinesis Consumer"
>>> version := "1.0"
>>> organization := "com.myconsumer"
>>> scalaVersion := "2.11.5"
>>>
>>>  libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0" %
>>> "provided"
>>> libraryDependencies += "org.apach

Processing Time Spikes (Spark Streaming)

2015-04-04 Thread t1ny
Hi all,

I am running some benchmarks on a simple Spark application which consists of
:
- textFileStream() to extract text records from HDFS files
- map() to parse records into JSON objects
- updateStateByKey() to calculate and store an in-memory state for each key.

The processing time per batch gets slower as time passes and the number of
states increases, that is expected. 
However, we also notice spikes occuring at rather regular intervals. What
could cause those spikes ? We first suspected the GC, but the logs/metrics
don't seem to show any significant GC-related delays. Could this be related
to checkpointing ? Disk access latencies ?

I've attached a graph so you can visualize the problem (please ignore the
first spike which corresponds to system initialization) :


 

Thanks !



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Processing-Time-Spikes-Spark-Streaming-tp22375.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: newAPIHadoopRDD Mutiple scan result return from Hbase

2015-04-04 Thread Jeetendra Gangele
Here is my conf object passing first parameter of API.
but here I want to pass multiple scan means i have 4 criteria for STRAT ROW
and STOROW in same table.
by using below code i can get result for one STARTROW and ENDROW.

Configuration conf = DBConfiguration.getConf();

// int scannerTimeout = (int) conf.getLong(
//  HConstants.HBASE_REGIONSERVER_LEASE_PERIOD_KEY, -1);
// System.out.println("lease timeout on server is"+scannerTimeout);

int scannerTimeout = (int) conf.getLong(
"hbase.client.scanner.timeout.period", -1);
// conf.setLong("hbase.client.scanner.timeout.period", 6L);
conf.set(TableInputFormat.INPUT_TABLE, TABLE_NAME);
Scan scan = new Scan();
scan.addFamily(FAMILY);
FilterList filterList = new FilterList(Operator.MUST_PASS_ALL);
filterList.addFilter(new KeyOnlyFilter());
 filterList.addFilter(new FirstKeyOnlyFilter());
scan.setFilter(filterList);

scan.setCacheBlocks(false);
scan.setCaching(10);
 scan.setBatch(1000);
scan.setSmall(false);
 conf.set(TableInputFormat.SCAN, DatabaseUtils.convertScanToString(scan));
return conf;

On 4 April 2015 at 20:54, Jeetendra Gangele  wrote:

> Hi All,
>
> Can we get the result of the multiple scan
> from JavaSparkContext.newAPIHadoopRDD from Hbase.
>
> This method first parameter take configuration object where I have added
> filter. but how Can I query multiple scan from same table calling this API
> only once?
>
> regards
> jeetendra
>


newAPIHadoopRDD Mutiple scan result return from Hbase

2015-04-04 Thread Jeetendra Gangele
Hi All,

Can we get the result of the multiple scan
from JavaSparkContext.newAPIHadoopRDD from Hbase.

This method first parameter take configuration object where I have added
filter. but how Can I query multiple scan from same table calling this API
only once?

regards
jeetendra


Re: Spark Streaming FileStream Nested File Support

2015-04-04 Thread Akhil Das
We've a custom version/build of sparktreaming doing the nested s3 lookups
faster (uses native S3 APIs). You can find the source code over here :
https://github.com/sigmoidanalytics/spark-modified, In particular the
changes from here
.
And the binary jars here :
https://github.com/sigmoidanalytics/spark-modified/tree/master/lib

Here's the instructions to use it:

This is how you create your stream:

val lines = ssc.*s3FileStream*[LongWritable, Text,
TextInputFormat]("bucketname/")


You need ACCESS_KEY and SECRET_KEY in the environment for this to work.
Also, by default it is recursive.

Also you need these jars
 in the
SPARK_CLASSPATH:


aws-java-sdk-1.8.3.jarhttpclient-4.2.5.jar
aws-java-sdk-1.9.24.jar   httpcore-4.3.2.jar
aws-java-sdk-core-1.9.24.jar  joda-time-2.6.jar
aws-java-sdk-s3-1.9.24.jarspark-streaming_2.10-1.2.0.jar



Let me know if you need any more clarification/information on this, feel
free to suggest changes.




Thanks
Best Regards

On Sat, Apr 4, 2015 at 3:30 AM, Tathagata Das  wrote:

> Yes, definitely can be added. Just haven't gotten around to doing it :)
> There are proposals for this that you can try -
> https://github.com/apache/spark/pull/2765/files . Have you review it at
> some point.
>
> On Fri, Apr 3, 2015 at 1:08 PM, Adam Ritter  wrote:
>
>> That doesn't seem like a good solution unfortunately as I would be
>> needing this to work in a production environment.  Do you know why the
>> limitation exists for FileInputDStream in the first place?  Unless I'm
>> missing something important about how some of the internals work I don't
>> see why this feature could be added in at some point.
>>
>> On Fri, Apr 3, 2015 at 12:47 PM, Tathagata Das 
>> wrote:
>>
>>> I sort-a-hacky workaround is to use a queueStream where you can manually
>>> create RDDs (using sparkContext.hadoopFile) and insert into the queue. Note
>>> that this is for testing only as queueStream does not work with driver
>>> fautl recovery.
>>>
>>> TD
>>>
>>> On Fri, Apr 3, 2015 at 12:23 PM, adamgerst  wrote:
>>>
 So after pulling my hair out for a bit trying to convert one of my
 standard
 spark jobs to streaming I found that FileInputDStream does not support
 nested folders (see the brief mention here

 http://spark.apache.org/docs/latest/streaming-programming-guide.html#basic-sources
 the fileStream method returns a FileInputDStream).  So before, for my
 standard job, I was reading from say

 s3n://mybucket/2015/03/02/*log

 And could also modify it to simply get an entire months worth of logs.
 Since the logs are split up based upon their date, when the batch ran
 for
 the day, I simply passed in a parameter of the date to make sure I was
 reading the correct data

 But since I want to turn this job into a streaming job I need to simply
 do
 something like

 s3n://mybucket/*log

 This would totally work fine if it were a standard spark application,
 but
 fails for streaming.  Is there anyway I can get around this limitation?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-FileStream-Nested-File-Support-tp22370.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


>>>
>>
>


Re: Spark Sql - Missing Jar ? json_tuple NoClassDefFoundError

2015-04-04 Thread Cheng Lian

Filed https://issues.apache.org/jira/browse/SPARK-6708 to track this.

Cheng

On 4/4/15 10:21 PM, Cheng Lian wrote:

I think this is a bug of Spark SQL dates back to at least 1.1.0.

The json_tuple function is implemented as 
org.apache.hadoop.hive.ql.udf.generic.GenericUDTFJSONTuple. The 
ClassNotFoundException should complain with the class name rather than 
the UDTF function name.


The problematic line should be this one 
. 
HiveFunctionWrapper expects the full qualified class name of the UDTF 
class that implements the function, but we pass in the function name.


Thanks for reporting this!

Cheng

On 4/2/15 3:19 AM, Todd Nist wrote:


I have a feeling I’m missing a Jar that provides the support or could 
this may be related to 
https://issues.apache.org/jira/browse/SPARK-5792. If it is a Jar 
where would I find that ? I would have thought in the $HIVE/lib 
folder, but not sure which jar contains it.


Error:

|Create  MetricTemporary  Table  for  querying
15/04/01  14:41:44  INFO HiveMetaStore:0: Opening raw storewith  implemenation 
class:org.apache.hadoop.hive.metastore.ObjectStore
15/04/01  14:41:44  INFO ObjectStore: ObjectStore, initialize called
15/04/01  14:41:45  INFO Persistence: Property 
hive.metastore.integral.jdo.pushdownunknown  - will be ignored
15/04/01  14:41:45  INFO Persistence: Property datanucleus.cache.level2unknown  
- will be ignored
15/04/01  14:41:45  INFO BlockManager: Removing broadcast0
15/04/01  14:41:45  INFO BlockManager: Removing block broadcast_0
15/04/01  14:41:45  INFO MemoryStore: Block broadcast_0of  size  1272  
droppedfrom  memory (free278018571)
15/04/01  14:41:45  INFO BlockManager: Removing block broadcast_0_piece0
15/04/01  14:41:45  INFO MemoryStore: Block broadcast_0_piece0of  size  869  
droppedfrom  memory (free278019440)
15/04/01  14:41:45  INFO BlockManagerInfo: Removed broadcast_0_piece0on  
192.168.1.5:63230  in  memory (size:869.0  B, free:265.1  MB)
15/04/01  14:41:45  INFO BlockManagerMaster: Updated infoof  block 
broadcast_0_piece0
15/04/01  14:41:45  INFO BlockManagerInfo: Removed broadcast_0_piece0on  
192.168.1.5:63278  in  memory (size:869.0  B, free:530.0  MB)
15/04/01  14:41:45  INFO ContextCleaner: Cleaned broadcast0
15/04/01  14:41:46  INFO ObjectStore: Setting MetaStore object pin classeswith  
hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
15/04/01  14:41:46  INFO Datastore: The 
class"org.apache.hadoop.hive.metastore.model.MFieldSchema"  is  taggedas  
"embedded-only"  so doesnot  have its own datastoretable.
15/04/01  14:41:46  INFO Datastore: The 
class"org.apache.hadoop.hive.metastore.model.MOrder"  is  taggedas  
"embedded-only"  so doesnot  have its own datastoretable.
15/04/01  14:41:47  INFO Datastore: The 
class"org.apache.hadoop.hive.metastore.model.MFieldSchema"  is  taggedas  
"embedded-only"  so doesnot  have its own datastoretable.
15/04/01  14:41:47  INFO Datastore: The 
class"org.apache.hadoop.hive.metastore.model.MOrder"  is  taggedas  
"embedded-only"  so doesnot  have its own datastoretable.
15/04/01  14:41:47  INFO Query: Readingin  resultsfor  
query"org.datanucleus.store.rdbms.query.SQLQuery@0"  since theconnection  
usedis  closing
15/04/01  14:41:47  INFO ObjectStore: Initialized ObjectStore
15/04/01  14:41:47  INFO HiveMetaStore: Added admin rolein  metastore
15/04/01  14:41:47  INFO HiveMetaStore: Addedpublic  rolein  metastore
15/04/01  14:41:48  INFO HiveMetaStore:No  user  is  addedin  admin role, since 
configis  empty
15/04/01  14:41:48  INFO SessionState:No  Tezsession  requiredat  this point. 
hive.execution.engine=mr.
15/04/01  14:41:49  INFO ParseDriver: Parsing command:SELECT  path, name,value, 
v1.peValue, v1.peName
  FROM  metric
  lateralview  json_tuple(pathElements,'name','value') v1
as  peName, peValue
15/04/01  14:41:49  INFO ParseDriver: Parse Completed
Exception  in  thread"main"  java.lang.ClassNotFoundException: json_tuple
 at  java.net.URLClassLoader$1.run(URLClassLoader.java:372)
 at  java.net.URLClassLoader$1.run(URLClassLoader.java:361)
 at  java.security.AccessController.doPrivileged(Native Method)
 at  java.net.URLClassLoader.findClass(URLClassLoader.java:360)
 at  java.lang.ClassLoader.loadClass(ClassLoader.java:424)
 at  java.lang.ClassLoader.loadClass(ClassLoader.java:357)
 at  
org.apache.spark.sql.hive.HiveFunctionWrapper.createFunction(Shim13.scala:141)
 at  
org.apache.spark.sql.hive.HiveGenericUdtf.function$lzycompute(hiveUdfs.scala:261)
 at  org.apache.spark.sql.hive.HiveGenericUdtf.function(hiveUdfs.scala:261)
 at  
org.apache.spark.sql.hive.HiveGenericUdtf.outputInspector$lzycompute(hiveUdfs.scala:267)
 at  
org.apache.spark.sql.hive.HiveGenericUdtf.outputInspector(hiveUdfs.scala:

Re: Spark Sql - Missing Jar ? json_tuple NoClassDefFoundError

2015-04-04 Thread Cheng Lian

I think this is a bug of Spark SQL dates back to at least 1.1.0.

The json_tuple function is implemented as 
org.apache.hadoop.hive.ql.udf.generic.GenericUDTFJSONTuple. The 
ClassNotFoundException should complain with the class name rather than 
the UDTF function name.


The problematic line should be this one 
. 
HiveFunctionWrapper expects the full qualified class name of the UDTF 
class that implements the function, but we pass in the function name.


Thanks for reporting this!

Cheng

On 4/2/15 3:19 AM, Todd Nist wrote:


I have a feeling I’m missing a Jar that provides the support or could 
this may be related to 
https://issues.apache.org/jira/browse/SPARK-5792. If it is a Jar where 
would I find that ? I would have thought in the $HIVE/lib folder, but 
not sure which jar contains it.


Error:

|Create  MetricTemporary  Table  for  querying
15/04/01  14:41:44  INFO HiveMetaStore:0: Opening raw storewith  implemenation 
class:org.apache.hadoop.hive.metastore.ObjectStore
15/04/01  14:41:44  INFO ObjectStore: ObjectStore, initialize called
15/04/01  14:41:45  INFO Persistence: Property 
hive.metastore.integral.jdo.pushdownunknown  - will be ignored
15/04/01  14:41:45  INFO Persistence: Property datanucleus.cache.level2unknown  
- will be ignored
15/04/01  14:41:45  INFO BlockManager: Removing broadcast0
15/04/01  14:41:45  INFO BlockManager: Removing block broadcast_0
15/04/01  14:41:45  INFO MemoryStore: Block broadcast_0of  size  1272  
droppedfrom  memory (free278018571)
15/04/01  14:41:45  INFO BlockManager: Removing block broadcast_0_piece0
15/04/01  14:41:45  INFO MemoryStore: Block broadcast_0_piece0of  size  869  
droppedfrom  memory (free278019440)
15/04/01  14:41:45  INFO BlockManagerInfo: Removed broadcast_0_piece0on  
192.168.1.5:63230  in  memory (size:869.0  B, free:265.1  MB)
15/04/01  14:41:45  INFO BlockManagerMaster: Updated infoof  block 
broadcast_0_piece0
15/04/01  14:41:45  INFO BlockManagerInfo: Removed broadcast_0_piece0on  
192.168.1.5:63278  in  memory (size:869.0  B, free:530.0  MB)
15/04/01  14:41:45  INFO ContextCleaner: Cleaned broadcast0
15/04/01  14:41:46  INFO ObjectStore: Setting MetaStore object pin classeswith  
hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
15/04/01  14:41:46  INFO Datastore: The 
class"org.apache.hadoop.hive.metastore.model.MFieldSchema"  is  taggedas  
"embedded-only"  so doesnot  have its own datastoretable.
15/04/01  14:41:46  INFO Datastore: The 
class"org.apache.hadoop.hive.metastore.model.MOrder"  is  taggedas  
"embedded-only"  so doesnot  have its own datastoretable.
15/04/01  14:41:47  INFO Datastore: The 
class"org.apache.hadoop.hive.metastore.model.MFieldSchema"  is  taggedas  
"embedded-only"  so doesnot  have its own datastoretable.
15/04/01  14:41:47  INFO Datastore: The 
class"org.apache.hadoop.hive.metastore.model.MOrder"  is  taggedas  
"embedded-only"  so doesnot  have its own datastoretable.
15/04/01  14:41:47  INFO Query: Readingin  resultsfor  
query"org.datanucleus.store.rdbms.query.SQLQuery@0"  since theconnection  
usedis  closing
15/04/01  14:41:47  INFO ObjectStore: Initialized ObjectStore
15/04/01  14:41:47  INFO HiveMetaStore: Added admin rolein  metastore
15/04/01  14:41:47  INFO HiveMetaStore: Addedpublic  rolein  metastore
15/04/01  14:41:48  INFO HiveMetaStore:No  user  is  addedin  admin role, since 
configis  empty
15/04/01  14:41:48  INFO SessionState:No  Tezsession  requiredat  this point. 
hive.execution.engine=mr.
15/04/01  14:41:49  INFO ParseDriver: Parsing command:SELECT  path, name,value, 
v1.peValue, v1.peName
  FROM  metric
  lateralview  json_tuple(pathElements,'name','value') v1
as  peName, peValue
15/04/01  14:41:49  INFO ParseDriver: Parse Completed
Exception  in  thread"main"  java.lang.ClassNotFoundException: json_tuple
 at  java.net.URLClassLoader$1.run(URLClassLoader.java:372)
 at  java.net.URLClassLoader$1.run(URLClassLoader.java:361)
 at  java.security.AccessController.doPrivileged(Native Method)
 at  java.net.URLClassLoader.findClass(URLClassLoader.java:360)
 at  java.lang.ClassLoader.loadClass(ClassLoader.java:424)
 at  java.lang.ClassLoader.loadClass(ClassLoader.java:357)
 at  
org.apache.spark.sql.hive.HiveFunctionWrapper.createFunction(Shim13.scala:141)
 at  
org.apache.spark.sql.hive.HiveGenericUdtf.function$lzycompute(hiveUdfs.scala:261)
 at  org.apache.spark.sql.hive.HiveGenericUdtf.function(hiveUdfs.scala:261)
 at  
org.apache.spark.sql.hive.HiveGenericUdtf.outputInspector$lzycompute(hiveUdfs.scala:267)
 at  
org.apache.spark.sql.hive.HiveGenericUdtf.outputInspector(hiveUdfs.scala:267)
 at  
org.apache.spark.sql.hive.HiveGenericUdtf.outputDataTypes$lzycompute(hiveUdfs.scala:272)
 at  
org

Re: Parquet timestamp support for Hive?

2015-04-04 Thread Cheng Lian
Avoiding maintaining a separate Hive version is one of the initial 
purpose of Spark SQL. (We had once done this for Shark.) The 
org.spark-project.hive:hive-0.13.1a artifact only cleans up some 3rd 
dependencies to avoid dependency hell in Spark. This artifact is exactly 
the same as Hive 0.13.1 at the source level.


On the other hand, we're planning to add a Hive metastore adapter layer 
to Spark SQL so that in the future we can talk to arbitrary versions 
greater than or equal to 0.13.1 of Hive metastore, and then always stick 
to the most recent Hive versions to provide the most recent Hive 
features. This will probably happen in Spark 1.4 or 1.5.


Cheng

On 4/3/15 7:59 PM, Rex Xiong wrote:

Hi,

I got this error when creating a hive table from parquet file:
DDLTask: org.apache.hadoop.hive.ql.metadata.HiveException: 
java.lang.UnsupportedOperationException: Parquet does not support 
timestamp. See HIVE-6384


I check HIVE-6384, it's fixed in 0.14.
The hive in spark build is a customized version 0.13.1a 
(GroupId: org.spark-project.hive), is it possible to get the source 
code for it and apply patch from HIVE-6384?


Thanks



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Issue of sqlContext.createExternalTable with parquet partition discovery after changing folder structure

2015-04-04 Thread Cheng Lian
You need to refresh the external table manually after updating the data 
source outside Spark SQL:


- via Scala API: sqlContext.refreshTable("table1")
- via SQL: REFRESH TABLE table1;

Cheng

On 4/4/15 5:24 PM, Rex Xiong wrote:

Hi Spark Users,

I'm testing 1.3 new feature of parquet partition discovery.
I have 2 sub folders, each has 800 rows.
/data/table1/key=1
/data/table1/key=2

In spark-shell, run this command:

val t = sqlContext.createExternalTable("table1", 
"hdfs:///data/table1", "parquet")


t.count


It shows 1600 successfully.

But after that, I add a new folder /data/table1/key=3, then run 
t.count again, it still gives me 1600, not 2400.



I try to restart spark-shell, then run

val t = sqlContext.table("table1")

t.count


It's 2400 now.


I'm wondering there should be a partition cache in driver, I try to 
set spark.sql.parquet.cacheMetadata to false and test it 
again, unfortunately it doesn't help.



How can I disable this partition cache or force refresh the cache?


Thanks





Re: Parquet Hive table become very slow on 1.3?

2015-04-04 Thread Cheng Lian

Hey Xudong,

We had been digging this issue for a while, and believe PR 5339 
 and PR 5334 
 should fix this issue.


There two problems:

1. Normally we cache Parquet table metadata for better performance, but 
when converting Hive metastore Hive tables, the cache is not used. Thus 
heavy operations like schema discovery is done every time a metastore 
Parquet table is converted.
2. With Parquet task side metadata reading (which is turned on by 
default), we can actually skip the row group information in the footer. 
However, we accidentally called a Parquet function which doesn't skip 
row group information.


For your question about schema merging, Parquet allows different 
part-files have different but compatible schemas. For example, 
part-1.parquet has columns a and b, while part-2.parquet may has 
columns a and c. In some cases, the summary files (_metadata and 
_common_metadata) contains the merged schema (a, b, and c), but it's not 
guaranteed. For example, when the user defined metadata stored different 
part-files contain different values for the same key, Parquet simply 
gives up writing summary files. That's why all part-files must be 
touched to get a precise merged schema.


However, in scenarios where a centralized arbitrative schema is 
available (e.g. Hive metastore schema, or the schema provided by user 
via data source DDL), we don't need to do schema merging on driver side, 
but defer it to executor side and each task only needs to reconcile 
those part-files it needs to touch. This is also what the Parquet 
developers did recently for parquet-hadoop 
.


Cheng

On 3/31/15 11:49 PM, Zheng, Xudong wrote:

Thanks Cheng!

Set 'spark.sql.parquet.useDataSourceApi' to false resolves my issues, 
but the PR 5231 seems not. Not sure any other things I did wrong ...


BTW, actually, we are very interested in the schema merging feature in 
Spark 1.3, so both these two solution will disable this feature, 
right? It seems that Parquet metadata is store in a file named 
_metadata in the Parquet file folder (each folder is a partition as we 
use partition table), why we need scan all Parquet part files? Is 
there any other solutions could keep schema merging feature at the 
same time? We are really like this feature :)


On Tue, Mar 31, 2015 at 3:19 PM, Cheng Lian > wrote:


Hi Xudong,

This is probably because of Parquet schema merging is turned on by
default. This is generally useful for Parquet files with different
but compatible schemas. But it needs to read metadata from all
Parquet part-files. This can be problematic when reading Parquet
files with lots of part-files, especially when the user doesn't
need schema merging.

This issue is tracked by SPARK-6575, and here is a PR for it:
https://github.com/apache/spark/pull/5231. This PR adds a
configuration to disable schema merging by default when doing Hive
metastore Parquet table conversion.

Another workaround is to fallback to the old Parquet code by
setting spark.sql.parquet.useDataSourceApi to false.

Cheng


On 3/31/15 2:47 PM, Zheng, Xudong wrote:

Hi all,

We are using Parquet Hive table, and we are upgrading to Spark
1.3. But we find that, just a simple COUNT(*) query will much
slower (100x) than Spark 1.2.

I find the most time spent on driver to get HDFS blocks. I find
large amount of get below logs printed:

15/03/30 23:03:43 DEBUG ProtobufRpcEngine: Call: getBlockLocations took 
2097ms
15/03/30 23:03:43 DEBUG DFSClient: newInfo = LocatedBlocks{
   fileLength=77153436
   underConstruction=false
   
blocks=[LocatedBlock{BP-1236294426-10.152.90.181-1425290838173:blk_1075187948_1448275; 
getBlockSize()=77153436; corrupt=false; offset=0; locs=[10.152.116.172:50010  
,10.152.116.169:50010  
, 10.153.125.184:50010]}]
   lastLocatedBlock=LocatedBlock{BP-1236294426-10.152.90.181-1425290838173:blk_1075187948  
_1448275; getBlockSize()=77153436; corrupt=false; offset=0; 
locs=[10.152.116.169:50010  ,10.153.125.184:50010  
,10.152.116.172:50010  ]}
   isLastBlockComplete=true}
15/03/30 23:03:43 DEBUG DFSClient: Connecting to datanode10.152.116.172:50010  


I compare the printed log with Spark 1.2, although the number of
getBlockLocations call is similar, but each such operation only
cost 20~30 ms (but it is 2000ms~3000ms now), and it didn't print
the detailed LocatedBlocks info.

Another finding is, if I read the Parquet file via scala code
form spark-shell as below, it looks fine, the computation will
return the result quick as before.

|sqlContext.par

Re: conversion from java collection type to scala JavaRDD

2015-04-04 Thread Dean Wampler
Without the rest of your code, it's hard to know what might be
unserializable.

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
 (O'Reilly)
Typesafe 
@deanwampler 
http://polyglotprogramming.com

On Sat, Apr 4, 2015 at 7:56 AM, Jeetendra Gangele 
wrote:

>
> Hi I have tried with parallelize but i got the below exception
>
> java.io.NotSerializableException: pacific.dr.VendorRecord
>
> Here is my code
>
> List
> vendorRecords=blockingKeys.getMatchingRecordsWithscan(matchKeysOutput);
> JavaRDD lines = sc.parallelize(vendorRecords)
>
>
> On 2 April 2015 at 21:11, Dean Wampler  wrote:
>
>> Use JavaSparkContext.parallelize.
>>
>>
>> http://spark.apache.org/docs/latest/api/java/org/apache/spark/api/java/JavaSparkContext.html#parallelize(java.util.List)
>>
>> Dean Wampler, Ph.D.
>> Author: Programming Scala, 2nd Edition
>>  (O'Reilly)
>> Typesafe 
>> @deanwampler 
>> http://polyglotprogramming.com
>>
>> On Thu, Apr 2, 2015 at 11:33 AM, Jeetendra Gangele 
>> wrote:
>>
>>> Hi All
>>> Is there an way to make the JavaRDD from existing java
>>> collection type List?
>>> I know this can be done using scala , but i am looking how to do this
>>> using java.
>>>
>>>
>>> Regards
>>> Jeetendra
>>>
>>
>>
>
>
>


Need help with ALS Recommendation code

2015-04-04 Thread Phani Yadavilli -X (pyadavil)
Hi ,

I am trying to run the following command in the Movie Recommendation example 
provided by the ampcamp tutorial

Command:   sbt package "run /movielens/medium"

Exception: sbt.TrapExitSecurityException thrown from the 
UncaughtExceptionHandler in thread "run-main-0"
java.lang.RuntimeException: Nonzero exit code: 1
at scala.sys.package$.error(package.scala:27)
[trace] Stack trace suppressed: run last compile:run for the full output.
[error] (compile:run) Nonzero exit code: 1
[error] Total time: 0 s, completed Apr 4, 2015 12:00:18 PM

I am unable to identify the error code.Can someone help me on this.

Regards
Phani Kumar


Re: conversion from java collection type to scala JavaRDD

2015-04-04 Thread Jeetendra Gangele
Hi I have tried with parallelize but i got the below exception

java.io.NotSerializableException: pacific.dr.VendorRecord

Here is my code

List
vendorRecords=blockingKeys.getMatchingRecordsWithscan(matchKeysOutput);
JavaRDD lines = sc.parallelize(vendorRecords)


On 2 April 2015 at 21:11, Dean Wampler  wrote:

> Use JavaSparkContext.parallelize.
>
>
> http://spark.apache.org/docs/latest/api/java/org/apache/spark/api/java/JavaSparkContext.html#parallelize(java.util.List)
>
> Dean Wampler, Ph.D.
> Author: Programming Scala, 2nd Edition
>  (O'Reilly)
> Typesafe 
> @deanwampler 
> http://polyglotprogramming.com
>
> On Thu, Apr 2, 2015 at 11:33 AM, Jeetendra Gangele 
> wrote:
>
>> Hi All
>> Is there an way to make the JavaRDD from existing java collection
>> type List?
>> I know this can be done using scala , but i am looking how to do this
>> using java.
>>
>>
>> Regards
>> Jeetendra
>>
>
>


Re: Spark Vs MR

2015-04-04 Thread Sean Owen
If data is on HDFS, it is not read any more or less quickly by either
framework. Both are in fact using the same logic to exploit locality,
and read and deserialize data anyway. I don't think this is what
anyone claims though.

Spark can be faster in a multi-stage operation, which would require
several MRs. The MRs must hit disk again after the reducer whereas
Spark might not, possibly by persisting outputs in memory. A similar
but larger speedup can be had for iterative computations that access
the same data in memory; caching it means reading it from disk once,
but then re-reading from memory only.

For a single operation that really is a map and a reduce, starting and
ending on HDFS, I would expect MR to be a bit faster just because it
is so optimized for this one pattern. Even that depends a lot, and
wouldn't be significant.


On Sat, Apr 4, 2015 at 11:19 AM, SamyaMaiti  wrote:
> How is spark faster than MR when data is in disk in both cases?
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Vs-MR-tp22373.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: 4 seconds to count 13M lines. Does it make sense?

2015-04-04 Thread SamyaMaiti
Reduce *spark.sql.shuffle.partitions* from default of 200 to total number of
cores.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/4-seconds-to-count-13M-lines-Does-it-make-sense-tp22360p22374.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark Vs MR

2015-04-04 Thread SamyaMaiti
How is spark faster than MR when data is in disk in both cases?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Vs-MR-tp22373.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Issue of sqlContext.createExternalTable with parquet partition discovery after changing folder structure

2015-04-04 Thread Rex Xiong
Hi Spark Users,

I'm testing 1.3 new feature of parquet partition discovery.
I have 2 sub folders, each has 800 rows.
/data/table1/key=1
/data/table1/key=2

In spark-shell, run this command:

val t = sqlContext.createExternalTable("table1", "hdfs:///data/table1",
"parquet")

t.count


It shows 1600 successfully.

But after that, I add a new folder /data/table1/key=3, then run t.count
again, it still gives me 1600, not 2400.


I try to restart spark-shell, then run

val t = sqlContext.table("table1")

t.count


It's 2400 now.


I'm wondering there should be a partition cache in driver, I try to
set spark.sql.parquet.cacheMetadata
to false and test it again, unfortunately it doesn't help.


How can I disable this partition cache or force refresh the cache?


Thanks


Re: Migrating from Spark 0.8.0 to Spark 1.3.0

2015-04-04 Thread Nick Pentreath
It shouldn't be too bad - pertinent changes & migration notes are here: 
http://spark.apache.org/docs/1.0.0/programming-guide.html#migrating-from-pre-10-versions-of-spark
 for pre-1.0 and here: 
http://spark.apache.org/docs/latest/sql-programming-guide.html#upgrading-from-spark-sql-10-12-to-13
 for SparkSQL pre-1.3




Since you aren't using SparkSQL the 2nd link is probably not useful. 




Generally you should find very few changes in the core API but things like 
MLlib would have changed a fair bit - though again the API should have been 
relatively stable.




Your biggest change is probably going to be running jobs through spark-submit 
rather than spark-class etc: 
http://spark.apache.org/docs/latest/submitting-applications.html










—
Sent from Mailbox

On Sat, Apr 4, 2015 at 1:11 AM, Ritesh Kumar Singh
 wrote:

> Hi,
> Are there any tutorials that explains all the changelogs between Spark
> 0.8.0 and Spark 1.3.0 and how can we approach this issue.

Re: spark mesos deployment : starting workers based on attributes

2015-04-04 Thread Ankur Chauhan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi,

Created issue: https://issues.apache.org/jira/browse/SPARK-6707
I would really appreciate ideas/views/opinions on this feature.

- -- Ankur Chauhan

On 03/04/2015 13:23, Tim Chen wrote:
> Hi Ankur,
> 
> There isn't a way to do that yet, but it's simple to add.
> 
> Can you create a JIRA in Spark for this?
> 
> Thanks!
> 
> Tim
> 
> On Fri, Apr 3, 2015 at 1:08 PM, Ankur Chauhan
> mailto:achau...@brightcove.com>> wrote:
> 
> Hi,
> 
> I am trying to figure out if there is a way to tell the mesos 
> scheduler in spark to isolate the workers to a set of mesos slaves 
> that have a given attribute such as `tachyon:true`.
> 
> Anyone knows if that is possible or how I could achieve such a
> behavior.
> 
> Thanks! -- Ankur Chauhan
> 
> -
>
> 
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>  For additional commands,
> e-mail: user-h...@spark.apache.org 
> 
> 
> 

- -- 
- -- Ankur Chauhan
-BEGIN PGP SIGNATURE-

iQEcBAEBAgAGBQJVH4xBAAoJEOSJAMhvLp3LMfsH/0oyQ4fGomCd8GnQzqVrZ6zj
cgwhOyntz5aaBdjipVez1EwzNzG/3kXzFnK3YpuT6SXdXuPLSD6NX62ju/Ii+86w
/Y15taXt1qo+Ah6CLkofCPAPY1HRCZ+KAM/KzW45W+uGvcUqyupPFUEvN/a9/hYC
Ok7AERk8Tw/CRoU/Fbz/23LxjK1TJUW1klaToUjyij2oakMUxT7HnqS08fCUBJF6
pEqXJ+gHGW3br6BJcvwce7my8bFlPShVP+exhcNhpmqjoRvSf//etmP2E0Me2hXM
ZmghjIqRhoAI4sJYIhEBBQS7r4AsI5FQyNkr8i4Hqed4dq61YA7FcpUCC+GDbTY=
=pVkB
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org