date:20150511

Sean,

How does this model actually work? Let's say we want to run one job as N
threads executing one particular task, e.g. streaming data out of Kafka
into a search engine. How do we configure our Spark job execution?

Right now, I'm seeing this job running as a single thread. And it's quite a
bit slower than just running a simple utility with a thread executor with a
thread pool of N threads doing the same task.

The performance I'm seeing of running the Kafka-Spark Streaming job is 7
times slower than that of the utility. What's pulling Spark back?

Thanks.

On Mon, May 11, 2015 at 4:55 PM, Sean Owen so...@cloudera.com wrote:

You have one worker with one executor with 32 execution slots.

On Mon, May 11, 2015 at 9:52 PM, dgoldenberg dgoldenberg...@gmail.com
wrote:
Hi,

Is there anything special one must do, running locally and submitting a
job
like so:

spark-submit \
--class com.myco.Driver \
--master local[*] \
./lib/myco.jar

In my logs, I'm only seeing log messages with the thread identifier of
Executor task launch worker-0.

There are 4 cores on the machine so I expected 4 threads to be at play.
Running with local[32] did not yield 32 worker threads.

Any recommendations? Thanks.

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Running-Spark-in-local-mode-seems-to-ignore-local-N-tp22851.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Running Spark in local mode seems to ignore local[N]

BTW I think my comment was wrong as marcelo demonstrated. In
standalone mode you'd have one worker, and you do have one executor,
but his explanation is right. But, you certainly have execution slots
for each core.

Are you talking about your own user code? you can make threads, but
that's nothing do with Spark then. If you run code on your driver,
it's not distributed. If you run Spark over an RDD with 1 partition,
only one task works on it.

On Mon, May 11, 2015 at 10:16 PM, Dmitry Goldenberg
dgoldenberg...@gmail.com wrote:
Sean,

How does this model actually work? Let's say we want to run one job as N
threads executing one particular task, e.g. streaming data out of Kafka into
a search engine. How do we configure our Spark job execution?

Right now, I'm seeing this job running as a single thread. And it's quite a
bit slower than just running a simple utility with a thread executor with a
thread pool of N threads doing the same task.

The performance I'm seeing of running the Kafka-Spark Streaming job is 7
times slower than that of the utility. What's pulling Spark back?

Thanks.

On Mon, May 11, 2015 at 4:55 PM, Sean Owen so...@cloudera.com wrote:

You have one worker with one executor with 32 execution slots.

On Mon, May 11, 2015 at 9:52 PM, dgoldenberg dgoldenberg...@gmail.com
wrote:
Hi,

Is there anything special one must do, running locally and submitting a
job
like so:

spark-submit \
--class com.myco.Driver \
--master local[*] \
./lib/myco.jar

In my logs, I'm only seeing log messages with the thread identifier of
Executor task launch worker-0.

There are 4 cores on the machine so I expected 4 threads to be at play.
Running with local[32] did not yield 32 worker threads.

Any recommendations? Thanks.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Is the AMP lab done next February?

Relaying an answer from AMP director Mike Franklin:

One year into the lab we got a 5 yr Expeditions in Computing Award as part
of the White House Big Data initiative in 2012, so we extend the lab for a
year.   We intend to start winding it down at the end of 2016, while
supporting existing projects and students who will be finishing up.   The
AMPLab faculty are starting discussions this summer about what research
challenges we'd like to tackle next, and how best to organize to do so.

An interesting thing to note is that the Spark project started at about
this point in the AMPLab predecessor project (RADLab) so we have a track
record of being able to make these transitions.


On Sat, May 9, 2015 at 8:43 PM, Justin Pihony justin.pih...@gmail.com
wrote:

 From  my StackOverflow question
 
 https://stackoverflow.com/questions/29593139/is-the-amp-lab-done-next-february
 
 :

 Is there a way to track whether Berkeley's AMP lab will indeed shutdown
 next
 year?

 From their about site:

 The AMPLab is a five-year collaborative effort at UC Berkeley and it
 was
 started in February 2011.

 So, I was curious if this was a hard date, or if it will be extended (or
 has
 already been extended?)



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Is-the-AMP-lab-done-next-February-tp22832.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Running Spark in local mode seems to ignore local[N]

Thanks, Sean. This was not yet digested data for me :)

The number of partitions in a streaming RDD is determined by the
block interval and the batch interval. I have seen the bit on
spark.streaming.blockInterval
in the doc but I didn't connect it with the batch interval and the number
of partitions.

On Mon, May 11, 2015 at 5:34 PM, Sean Owen so...@cloudera.com wrote:

You might have a look at the Spark docs to start. 1 batch = 1 RDD, but
1 RDD can have many partitions. And should, for scale. You do not
submit multiple jobs to get parallelism.

The number of partitions in a streaming RDD is determined by the block
interval and the batch interval. If you have a batch interval of 10s
and block interval of 1s you'll get 10 partitions of data in the RDD.

On Mon, May 11, 2015 at 10:29 PM, Dmitry Goldenberg
dgoldenberg...@gmail.com wrote:
Understood. We'll use the multi-threaded code we already have..

How are these execution slots filled up? I assume each slot is dedicated
to
one submitted task. If that's the case, how is each task distributed
then,
i.e. how is that task run in a multi-node fashion? Say 1000
batches/RDD's
are extracted out of Kafka, how does that relate to the number of
executors
vs. task slots?

Presumably we can fill up the slots with multiple instances of the same
task... How do we know how many to launch?

On Mon, May 11, 2015 at 5:20 PM, Sean Owen so...@cloudera.com wrote:

On Mon, May 11, 2015 at 10:16 PM, Dmitry Goldenberg
dgoldenberg...@gmail.com wrote:
Sean,

How does this model actually work? Let's say we want to run one job
as N
threads executing one particular task, e.g. streaming data out of
Kafka
into
a search engine. How do we configure our Spark job execution?

Right now, I'm seeing this job running as a single thread. And it's
quite a
bit slower than just running a simple utility with a thread executor
with a
thread pool of N threads doing the same task.

The performance I'm seeing of running the Kafka-Spark Streaming job
is 7
times slower than that of the utility. What's pulling Spark back?

Thanks.

On Mon, May 11, 2015 at 4:55 PM, Sean Owen so...@cloudera.com
wrote:

You have one worker with one executor with 32 execution slots.

On Mon, May 11, 2015 at 9:52 PM, dgoldenberg
dgoldenberg...@gmail.com
wrote:
Hi,

Is there anything special one must do, running locally and
submitting
a
job
like so:

spark-submit \
--class com.myco.Driver \
--master local[*] \
./lib/myco.jar

In my logs, I'm only seeing log messages with the thread identifier
of
Executor task launch worker-0.

There are 4 cores on the machine so I expected 4 threads to be at
play.
Running with local[32] did not yield 32 worker threads.

Any recommendations? Thanks.

--
View this message in context:

http://apache-spark-user-list.1001560.n3.nabble.com/Running-Spark-in-local-mode-seems-to-ignore-local-N-tp22851.html
Sent from the Apache Spark User List mailing list archive at
Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Reading Nested Fields in DataFrames

Since there is an array here you are probably looking for HiveQL's LATERAL
VIEW explode
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView
.

On Mon, May 11, 2015 at 7:12 AM, ayan guha guha.a...@gmail.com wrote:

 Typically you would use . notation to access, same way you would access a
 map.
 On 12 May 2015 00:06, Ashish Kumar Singh ashish23...@gmail.com wrote:

 Hi ,
 I am trying to read Nested Avro data in Spark 1.3 using DataFrames.
 I need help to retrieve the Inner element data in the Structure below.

 Below is the schema when I enter df.printSchema :

  |-- THROTTLING_PERCENTAGE: double (nullable = false)
  |-- IMPRESSION_TYPE: string (nullable = false)
  |-- campaignArray: array (nullable = false)
  ||-- element: struct (containsNull = false)
  |||-- COOKIE: string (nullable = false)
  |||-- CAMPAIGN_ID: long (nullable = false)


 How can I access CAMPAIGN_ID field in this schema ?

 Thanks,
 Ashish Kr. Singh

Re: Met a problem when using spark to load parquet files with different version schemas


 BTW, I use spark 1.3.1, and already set
 spark.sql.parquet.useDataSourceApi to false.


Schema merging is only supported when this flag is set to true (setting it
to false uses old code that will be removed once the new code is proven).

Re: Get a list of temporary RDD tables via Thrift

Temporary tables are not displayed by SHOW TABLES until Spark 1.3.

On Mon, May 11, 2015 at 12:54 PM, Judy Nash judyn...@exchange.microsoft.com
 wrote:

  Hi,



 How can I get a list of temporary tables via Thrift?



 Have used thrift’s startWithContext and registered a temp table, but not
 seeing the temp table/rdd when running “show tables”.





 Thanks,

 Judy

Re: Looking inside the 'mapPartitions' transformation, some confused observations

2015-05-11 Thread Richard Marscher

I believe the issue in b and c is that you call iter.size which actually is
going to flush the iterator so the subsequent attempt to put it into a
vector will yield 0 items. You could use an ArrayBuilder for example and
not need to rely on knowing the size of the iterator.

On Mon, May 11, 2015 at 2:26 PM, myasuka myas...@live.com wrote:

 As we all know, a partition in Spark is actually an Iterator[T]. For some
 purpose, I want to treat each partition not an Iterator but one whole
 object. For example, treat Iterator[Int] to a
 breeze.linalg.DenseVector[Int]. Thus I use 'mapPartitions' API to achieve
 this, however, during the implementation, I observed some confused
 observations.
 I use Spark 1.3.0 on 10 executor nodes cluster, below is different
 attempts:

 /import breeze.linalg.DenseVector/

 val a = sc.parallelize( 1 to 100, 10)

 val b = a.mapPartitions(iter ={val v = Array.ofDim[Int](iter.size)
 var ind = 0
 while(iter.hasNext){
  v(ind) = iter.next
  ind += 1
 }
 println(v.mkString(,))
 Iterator.single[DenseVector[Int]](DenseVector(v))}
 )
 b.count()

 val c = a.mapPartitions(iter ={val v = Array.ofDim[Int](iter.size)
 iter.copyToArray(v, 0, 10)
 println(v.mkString(,))
 Iterator.single[DenseVector[Int]](DenseVector(v))}
 )
 c.count()

 val d = a.mapPartitions(iter ={val v = iter.toArray
 println(v.mkString(,))
 Iterator.single[DenseVector[Int]](DenseVector(v))}
 )
 d.count()

 I can see the printed output in the executor's stdout, actually only
 attempt
 'd' satisfy my needs, and other attempts only get a zero desevector, which
 means the variable assignment from iterator to vector did not happen.

 Hope for explanations for these observations.
 Thanks.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Looking-inside-the-mapPartitions-transformation-some-confused-observations-tp22850.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Getting error running MLlib example with new cluster

That is mostly the YARN overhead. You're starting up a container for the AM
and executors, at least. That still sounds pretty slow, but the defaults
aren't tuned for fast startup.
On May 11, 2015 7:00 PM, Su She suhsheka...@gmail.com wrote:

 Got it to work on the cluster by changing the master to yarn-cluster
 instead of local! I do have a couple follow up questions...

 This is the example I was trying to
 run:
 https://github.com/holdenk/learning-spark-examples/blob/master/src/main/scala/com/oreilly/learningsparkexamples/scala/MLlib.scala

 1) The example still takes about 1 min 15 seconds to run (my cluster
 has 3 m3.large nodes). This seems really long for building a model
 based off data that is about 10 lines long. Is this normal?

 2) Any guesses as to why it was able to run in the cluster, but not
 locally?

 Thanks for the help!


 On Mon, Apr 27, 2015 at 11:48 AM, Su She suhsheka...@gmail.com wrote:
  Hello Xiangrui,
 
  I am using this spark-submit command (as I do for all other jobs):
 
 
 /opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/spark/bin/spark-submit
  --class MLlib --master local[2] --jars $(echo
  /home/ec2-user/sparkApps/learning-spark/lib/*.jar | tr ' ' ',')
  /home/ec2-user/sparkApps/learning-spark/target/simple-project-1.1.jar
 
  Thank you for the help!
 
  Best,
 
  Su
 
 
  On Mon, Apr 27, 2015 at 9:58 AM, Xiangrui Meng men...@gmail.com wrote:
  How did you run the example app? Did you use spark-submit? -Xiangrui
 
  On Thu, Apr 23, 2015 at 2:27 PM, Su She suhsheka...@gmail.com wrote:
  Sorry, accidentally sent the last email before finishing.
 
  I had asked this question before, but wanted to ask again as I think
  it is now related to my pom file or project setup. Really appreciate
 the help!
 
  I have been trying on/off for the past month to try to run this MLlib
  example:
 https://github.com/databricks/learning-spark/blob/master/src/main/scala/com/oreilly/learningsparkexamples/scala/MLlib.scala
 
  I am able to build the project successfully. When I run it, it returns:
 
  features in spam: 8
  features in ham: 7
 
  and then freezes. According to the UI, the description of the job is
  count at DataValidators.scala.38. This corresponds to this line in
  the code:
 
  val model = lrLearner.run(trainingData)
 
  I've tried just about everything I can think of...changed numFeatures
  from 1 - 10,000, set executor memory to 1g, set up a new cluster, at
  this point I think I might have missed dependencies as that has
  usually been the problem in other spark apps I have tried to run. This
  is my pom file, that I have used for other successful spark apps.
  Please let me know if you think I need any additional dependencies or
  there are incompatibility issues, or a pom.xml that is better to use.
  Thank you!
 
  Cluster information:
 
  Spark version: 1.2.0-SNAPSHOT (in my older cluster it is 1.2.0)
  java version 1.7.0_25
  Scala version: 2.10.4
  hadoop version: hadoop 2.5.0-cdh5.3.3 (older cluster was 5.3.0)
 
 
 
  project xmlns = http://maven.apache.org/POM/4.0.0;
  xmlns:xsi=http://w3.org/2001/XMLSchema-instance; xsi:schemaLocation
  =http://maven.apache.org/POM/4.0.0
  http://maven.apache.org/maven-v4_0_0.xsd;
  groupId edu.berkely/groupId
  artifactId simple-project /artifactId
  modelVersion 4.0.0/modelVersion
  name Simple Project /name
  packaging jar /packaging
  version 1.0 /version
  repositories
  repository
  idcloudera/id
  url
 http://repository.cloudera.com/artifactory/cloudera-repos//url
  /repository
 
  repository
  idscala-tools.org/id
  nameScala-tools Maven2 Repository/name
  urlhttp://scala-tools.org/repo-releases/url
  /repository
 
  /repositories
 
  pluginRepositories
  pluginRepository
  idscala-tools.org/id
  nameScala-tools Maven2 Repository/name
  urlhttp://scala-tools.org/repo-releases/url
  /pluginRepository
  /pluginRepositories
 
  build
  plugins
  plugin
  groupIdorg.scala-tools/groupId
  artifactIdmaven-scala-plugin/artifactId
  executions
 
  execution
  idcompile/id
  goals
  goalcompile/goal
  /goals
  phasecompile/phase
  /execution
  execution
  idtest-compile/id
  goals
 
  goaltestCompile/goal
  /goals
  phasetest-compile/phase

Re: large volume spark job spends most of the time in AppendOnlyMap.changeValue

Looks like it is spending a lot of time doing hash probing. It could be a
number of the following:

1. hash probing itself is inherently expensive compared with rest of your
workload

2. murmur3 doesn't work well with this key distribution

3. quadratic probing (triangular sequence) with a power-of-2 hash table
works really badly for this workload.

One way to test this is to instrument changeValue function to store the
number of probes in total, and then log it. We added this probing
capability to the new Bytes2Bytes hash map we built. We should consider
just having it being reported as some built-in metrics to facilitate
debugging.

https://github.com/apache/spark/blob/b83091ae4589feea78b056827bc3b7659d271e41/unsafe/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java#L214






On Mon, May 11, 2015 at 4:21 AM, Michal Haris michal.ha...@visualdna.com
wrote:

 This is the stack trace of the worker thread:


 org.apache.spark.util.collection.AppendOnlyMap.changeValue(AppendOnlyMap.scala:150)

 org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(SizeTrackingAppendOnlyMap.scala:32)

 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:130)
 org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:60)

 org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46)
 org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 org.apache.spark.scheduler.Task.run(Task.scala:64)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:745)

 On 8 May 2015 at 22:12, Josh Rosen rosenvi...@gmail.com wrote:

 Do you have any more specific profiling data that you can share?  I'm
 curious to know where AppendOnlyMap.changeValue is being called from.

 On Fri, May 8, 2015 at 1:26 PM, Michal Haris michal.ha...@visualdna.com
 wrote:

 +dev
 On 6 May 2015 10:45, Michal Haris michal.ha...@visualdna.com wrote:

  Just wanted to check if somebody has seen similar behaviour or knows
 what
  we might be doing wrong. We have a relatively complex spark application
  which processes half a terabyte of data at various stages. We have
 profiled
  it in several ways and everything seems to point to one place where
 90% of
  the time is spent:  AppendOnlyMap.changeValue. The job scales and is
  relatively faster than its map-reduce alternative but it still feels
 slower
  than it should be. I am suspecting too much spill but I haven't seen
 any
  improvement by increasing number of partitions to 10k. Any idea would
 be
  appreciated.
 
  --
  Michal Haris
  Technical Architect
  direct line: +44 (0) 207 749 0229
  www.visualdna.com | t: +44 (0) 207 734 7033,
 





 --
 Michal Haris
 Technical Architect
 direct line: +44 (0) 207 749 0229
 www.visualdna.com | t: +44 (0) 207 734 7033,

Re: [SparkSQL 1.4.0] groupBy columns are always nullable?

2015-05-11 Thread Olivier Girardot

Hi Haopu,
actually here `key` is nullable because this is your input's schema :

scala result.printSchema
root
|-- key: string (nullable = true)
|-- SUM(value): long (nullable = true)

scala df.printSchema
root
|-- key: string (nullable = true)
|-- value: long (nullable = false)

I tried it with a schema where the key is not flagged as nullable, and the
schema is actually respected. What you can argue however is that SUM(value)
should also be not nullable since value is not nullable.

@rxin do you think it would be reasonable to flag the Sum aggregation
function as nullable (or not) depending on the input expression's schema ?

Regards,

Olivier.
Le lun. 11 mai 2015 à 22:07, Reynold Xin r...@databricks.com a écrit :

 Not by design. Would you be interested in submitting a pull request?

 On Mon, May 11, 2015 at 1:48 AM, Haopu Wang hw...@qilinsoft.com wrote:

 I try to get the result schema of aggregate functions using DataFrame
 API.

 However, I find the result field of groupBy columns are always nullable
 even the source field is not nullable.

 I want to know if this is by design, thank you! Below is the simple code
 to show the issue.

 ==

   import sqlContext.implicits._
   import org.apache.spark.sql.functions._
   case class Test(key: String, value: Long)
   val df = sc.makeRDD(Seq(Test(k1,2),Test(k1,1))).toDF

   val result = df.groupBy(key).agg($key, sum(value))

   // From the output, you can see the key column is nullable, why??
   result.printSchema
 //root
 // |-- key: string (nullable = true)
 // |-- SUM(value): long (nullable = true)


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Kafka stream fails: java.lang.NoClassDefFound com/yammer/metrics/core/Gauge

2015-05-11 Thread Ted Yu

com.yammer.metrics.core.Gauge is in metrics-core jar
e.g., in master branch:
[INFO] |  \- org.apache.kafka:kafka_2.10:jar:0.8.1.1:compile
[INFO] | +- com.yammer.metrics:metrics-core:jar:2.2.0:compile

Please make sure metrics-core jar is on the classpath.

On Mon, May 11, 2015 at 1:32 PM, Lee McFadden splee...@gmail.com wrote:

 Hi,

 We've been having some issues getting spark streaming running correctly
 using a Kafka stream, and we've been going around in circles trying to
 resolve this dependency.

 Details of our environment and the error below, if anyone can help resolve
 this it would be much appreciated.

 Submit command line:

 /home/ubuntu/spark/spark-1.3.1/bin/spark-submit \
 --packages
 TargetHolding/pyspark-cassandra:0.1.4,org.apache.spark:spark-streaming-kafka_2.10:1.3.1
 \
 --conf
 spark.cassandra.connection.host=10.10.103.172,10.10.102.160,10.10.101.79 \
 --master spark://127.0.0.1:7077 \
 affected_hosts.py

 When we run the streaming job everything starts just fine, then we see the
 following in the logs:

 15/05/11 19:50:46 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 70,
 ip-10-10-102-53.us-west-2.compute.internal):
 java.lang.NoClassDefFoundError: com/yammer/metrics/core/Gauge
 at
 kafka.consumer.ZookeeperConsumerConnector.createFetcher(ZookeeperConsumerConnector.scala:151)
 at
 kafka.consumer.ZookeeperConsumerConnector.init(ZookeeperConsumerConnector.scala:115)
 at
 kafka.consumer.ZookeeperConsumerConnector.init(ZookeeperConsumerConnector.scala:128)
 at kafka.consumer.Consumer$.create(ConsumerConnector.scala:89)
 at
 org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:100)
 at
 org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121)
 at
 org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106)
 at
 org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$8.apply(ReceiverTracker.scala:298)
 at
 org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$8.apply(ReceiverTracker.scala:290)
 at
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498)
 at
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498)
 at
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:64)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.ClassNotFoundException: com.yammer.metrics.core.Gauge
 at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
 ... 17 more

Spark and RabbitMQ

2015-05-11 Thread dgoldenberg

Are there existing or under development versions/modules for streaming
messages out of RabbitMQ with SparkStreaming, or perhaps a RabbitMQ RDD?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-RabbitMQ-tp22852.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Running Spark in local mode seems to ignore local[N]

You have one worker with one executor with 32 execution slots.

On Mon, May 11, 2015 at 9:52 PM, dgoldenberg dgoldenberg...@gmail.com wrote:
 Hi,

 Is there anything special one must do, running locally and submitting a job
 like so:

 spark-submit \
 --class com.myco.Driver \
 --master local[*]  \
 ./lib/myco.jar

 In my logs, I'm only seeing log messages with the thread identifier of
 Executor task launch worker-0.

 There are 4 cores on the machine so I expected 4 threads to be at play.
 Running with local[32] did not yield 32 worker threads.

 Any recommendations? Thanks.



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Running-Spark-in-local-mode-seems-to-ignore-local-N-tp22851.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Kafka stream fails: java.lang.NoClassDefFound com/yammer/metrics/core/Gauge

2015-05-11 Thread Lee McFadden

Thanks Ted,

The issue is that I'm using packages (see spark-submit definition) and I do
not know how to add com.yammer.metrics:metrics-core to my classpath so
Spark can see it.

Should metrics-core not be part of
the org.apache.spark:spark-streaming-kafka_2.10:1.3.1 package so it can
work correctly?

If not, any clues as to how I can add metrics-core to my project (bearing
in mind that I'm using Python, not a JVM language) would be much
appreciated.

Thanks, and apologies for my newbness with Java/Scala.

On Mon, May 11, 2015 at 1:42 PM Ted Yu yuzhih...@gmail.com wrote:

 com.yammer.metrics.core.Gauge is in metrics-core jar
 e.g., in master branch:
 [INFO] |  \- org.apache.kafka:kafka_2.10:jar:0.8.1.1:compile
 [INFO] | +- com.yammer.metrics:metrics-core:jar:2.2.0:compile

 Please make sure metrics-core jar is on the classpath.

 On Mon, May 11, 2015 at 1:32 PM, Lee McFadden splee...@gmail.com wrote:

 Hi,

 We've been having some issues getting spark streaming running correctly
 using a Kafka stream, and we've been going around in circles trying to
 resolve this dependency.

 Details of our environment and the error below, if anyone can help
 resolve this it would be much appreciated.

 Submit command line:

 /home/ubuntu/spark/spark-1.3.1/bin/spark-submit \
 --packages
 TargetHolding/pyspark-cassandra:0.1.4,org.apache.spark:spark-streaming-kafka_2.10:1.3.1
 \
 --conf
 spark.cassandra.connection.host=10.10.103.172,10.10.102.160,10.10.101.79 \
 --master spark://127.0.0.1:7077 \
 affected_hosts.py

 When we run the streaming job everything starts just fine, then we see
 the following in the logs:

 15/05/11 19:50:46 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID
 70, ip-10-10-102-53.us-west-2.compute.internal):
 java.lang.NoClassDefFoundError: com/yammer/metrics/core/Gauge
 at
 kafka.consumer.ZookeeperConsumerConnector.createFetcher(ZookeeperConsumerConnector.scala:151)
 at
 kafka.consumer.ZookeeperConsumerConnector.init(ZookeeperConsumerConnector.scala:115)
 at
 kafka.consumer.ZookeeperConsumerConnector.init(ZookeeperConsumerConnector.scala:128)
 at kafka.consumer.Consumer$.create(ConsumerConnector.scala:89)
 at
 org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:100)
 at
 org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121)
 at
 org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106)
 at
 org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$8.apply(ReceiverTracker.scala:298)
 at
 org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$8.apply(ReceiverTracker.scala:290)
 at
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498)
 at
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498)
 at
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:64)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.ClassNotFoundException: com.yammer.metrics.core.Gauge
 at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
 ... 17 more

Get a list of temporary RDD tables via Thrift

2015-05-11 Thread Judy Nash

Hi,

How can I get a list of temporary tables via Thrift?

Have used thrift's startWithContext and registered a temp table, but not seeing 
the temp table/rdd when running show tables.


Thanks,
Judy

Re: [SparkSQL 1.4.0] groupBy columns are always nullable?

Thanks for catching this. I didn't read carefully enough.

It'd make sense to have the udaf result be non-nullable, if the exprs are
indeed non-nullable.

On Mon, May 11, 2015 at 1:32 PM, Olivier Girardot ssab...@gmail.com wrote:

 Hi Haopu,
 actually here `key` is nullable because this is your input's schema :

 scala result.printSchema
 root
 |-- key: string (nullable = true)
 |-- SUM(value): long (nullable = true)

 scala df.printSchema
 root
 |-- key: string (nullable = true)
 |-- value: long (nullable = false)

 I tried it with a schema where the key is not flagged as nullable, and the
 schema is actually respected. What you can argue however is that SUM(value)
 should also be not nullable since value is not nullable.

 @rxin do you think it would be reasonable to flag the Sum aggregation
 function as nullable (or not) depending on the input expression's schema ?

 Regards,

 Olivier.
 Le lun. 11 mai 2015 à 22:07, Reynold Xin r...@databricks.com a écrit :

 Not by design. Would you be interested in submitting a pull request?

 On Mon, May 11, 2015 at 1:48 AM, Haopu Wang hw...@qilinsoft.com wrote:

 I try to get the result schema of aggregate functions using DataFrame
 API.

 However, I find the result field of groupBy columns are always nullable
 even the source field is not nullable.

 I want to know if this is by design, thank you! Below is the simple code
 to show the issue.

 ==

   import sqlContext.implicits._
   import org.apache.spark.sql.functions._
   case class Test(key: String, value: Long)
   val df = sc.makeRDD(Seq(Test(k1,2),Test(k1,1))).toDF

   val result = df.groupBy(key).agg($key, sum(value))

   // From the output, you can see the key column is nullable, why??
   result.printSchema
 //root
 // |-- key: string (nullable = true)
 // |-- SUM(value): long (nullable = true)


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: [SparkSQL 1.4.0] groupBy columns are always nullable?

Not by design. Would you be interested in submitting a pull request?

On Mon, May 11, 2015 at 1:48 AM, Haopu Wang hw...@qilinsoft.com wrote:

 I try to get the result schema of aggregate functions using DataFrame
 API.

 However, I find the result field of groupBy columns are always nullable
 even the source field is not nullable.

 I want to know if this is by design, thank you! Below is the simple code
 to show the issue.

 ==

   import sqlContext.implicits._
   import org.apache.spark.sql.functions._
   case class Test(key: String, value: Long)
   val df = sc.makeRDD(Seq(Test(k1,2),Test(k1,1))).toDF

   val result = df.groupBy(key).agg($key, sum(value))

   // From the output, you can see the key column is nullable, why??
   result.printSchema
 //root
 // |-- key: string (nullable = true)
 // |-- SUM(value): long (nullable = true)


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Specify Python interpreter

2015-05-11 Thread Bin Wang

Hey there,

I have installed a python interpreter in certain location, say
/opt/local/anaconda.

Is there anything that I can specify the Python interpreter while
developing in iPython notebook? Maybe a property in the while creating the
Sparkcontext?


I know that I can put #!/opt/local/anaconda at the top of my Python code
and use spark-submit to distribute it to the cluster. However, since I am
using iPython notebook, this is not available as an option.

Best,

Bin

Kafka stream fails: java.lang.NoClassDefFound com/yammer/metrics/core/Gauge

2015-05-11 Thread Lee McFadden

Hi,

We've been having some issues getting spark streaming running correctly
using a Kafka stream, and we've been going around in circles trying to
resolve this dependency.

Details of our environment and the error below, if anyone can help resolve
this it would be much appreciated.

Submit command line:

/home/ubuntu/spark/spark-1.3.1/bin/spark-submit \
--packages
TargetHolding/pyspark-cassandra:0.1.4,org.apache.spark:spark-streaming-kafka_2.10:1.3.1
\
--conf
spark.cassandra.connection.host=10.10.103.172,10.10.102.160,10.10.101.79 \
--master spark://127.0.0.1:7077 \
affected_hosts.py

When we run the streaming job everything starts just fine, then we see the
following in the logs:

15/05/11 19:50:46 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 70,
ip-10-10-102-53.us-west-2.compute.internal):
java.lang.NoClassDefFoundError: com/yammer/metrics/core/Gauge
at
kafka.consumer.ZookeeperConsumerConnector.createFetcher(ZookeeperConsumerConnector.scala:151)
at
kafka.consumer.ZookeeperConsumerConnector.init(ZookeeperConsumerConnector.scala:115)
at
kafka.consumer.ZookeeperConsumerConnector.init(ZookeeperConsumerConnector.scala:128)
at kafka.consumer.Consumer$.create(ConsumerConnector.scala:89)
at
org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:100)
at
org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121)
at
org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106)
at
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$8.apply(ReceiverTracker.scala:298)
at
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$8.apply(ReceiverTracker.scala:290)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498)
at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: com.yammer.metrics.core.Gauge
at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 17 more

Re: Running Spark in local mode seems to ignore local[N]

2015-05-11 Thread Marcelo Vanzin

Are you actually running anything that requires all those slots? e.g.,
locally, I get this with local[16], but only after I run something that
actually uses those 16 slots:

Executor task launch worker-15 daemon prio=10 tid=0x7f4c80029800
nid=0x8ce waiting on condition [0x7f4c62493000]
Executor task launch worker-14 daemon prio=10 tid=0x7f4c80027800
nid=0x8cd waiting on condition [0x7f4c62594000]
Executor task launch worker-13 daemon prio=10 tid=0x7f4c80025800
nid=0x8cc waiting on condition [0x7f4c62695000]
Executor task launch worker-12 daemon prio=10 tid=0x7f4c80023800
nid=0x8cb waiting on condition [0x7f4c62796000]
Executor task launch worker-11 daemon prio=10 tid=0x7f4c80021800
nid=0x8ca waiting on condition [0x7f4c62897000]
Executor task launch worker-10 daemon prio=10 tid=0x7f4c8001f800
nid=0x8c9 waiting on condition [0x7f4c62998000]
Executor task launch worker-9 daemon prio=10 tid=0x7f4c8001d800
nid=0x8c8 waiting on condition [0x7f4c62a99000]
Executor task launch worker-8 daemon prio=10 tid=0x7f4c8001b800
nid=0x8c7 waiting on condition [0x7f4c62b9a000]
Executor task launch worker-7 daemon prio=10 tid=0x7f4c80019800
nid=0x8c6 waiting on condition [0x7f4c62c9b000]
Executor task launch worker-6 daemon prio=10 tid=0x7f4c80018000
nid=0x8c5 waiting on condition [0x7f4c62d9c000]
Executor task launch worker-5 daemon prio=10 tid=0x7f4c80011000
nid=0x8c4 waiting on condition [0x7f4c62e9d000]
Executor task launch worker-4 daemon prio=10 tid=0x7f4c8000f800
nid=0x8c3 waiting on condition [0x7f4c62f9e000]
Executor task launch worker-3 daemon prio=10 tid=0x7f4c8000e000
nid=0x8c2 waiting on condition [0x7f4c6309f000]
Executor task launch worker-2 daemon prio=10 tid=0x7f4c8000c800
nid=0x8c1 waiting on condition [0x7f4c631a]
Executor task launch worker-1 daemon prio=10 tid=0x7f4c80007800
nid=0x8c0 waiting on condition [0x7f4c632a1000]
Executor task launch worker-0 daemon prio=10 tid=0x7f4c80015800
nid=0x8bf waiting on condition [0x7f4c635f4000]


On Mon, May 11, 2015 at 1:52 PM, dgoldenberg dgoldenberg...@gmail.com
wrote:

 Hi,

 Is there anything special one must do, running locally and submitting a job
 like so:

 spark-submit \
 --class com.myco.Driver \
 --master local[*]  \
 ./lib/myco.jar

 In my logs, I'm only seeing log messages with the thread identifier of
 Executor task launch worker-0.

 There are 4 cores on the machine so I expected 4 threads to be at play.
 Running with local[32] did not yield 32 worker threads.

 Any recommendations? Thanks.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Running-Spark-in-local-mode-seems-to-ignore-local-N-tp22851.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 
Marcelo

Re: Running Spark in local mode seems to ignore local[N]

Understood. We'll use the multi-threaded code we already have..

How are these execution slots filled up? I assume each slot is dedicated to
one submitted task. If that's the case, how is each task distributed then,
i.e. how is that task run in a multi-node fashion? Say 1000 batches/RDD's
are extracted out of Kafka, how does that relate to the number of executors
vs. task slots?

Presumably we can fill up the slots with multiple instances of the same
task... How do we know how many to launch?

On Mon, May 11, 2015 at 5:20 PM, Sean Owen so...@cloudera.com wrote:

On Mon, May 11, 2015 at 10:16 PM, Dmitry Goldenberg
dgoldenberg...@gmail.com wrote:
Sean,

How does this model actually work? Let's say we want to run one job as N
threads executing one particular task, e.g. streaming data out of Kafka
into
a search engine. How do we configure our Spark job execution?

Right now, I'm seeing this job running as a single thread. And it's
quite a
bit slower than just running a simple utility with a thread executor
with a
thread pool of N threads doing the same task.

The performance I'm seeing of running the Kafka-Spark Streaming job is 7
times slower than that of the utility. What's pulling Spark back?

Thanks.

On Mon, May 11, 2015 at 4:55 PM, Sean Owen so...@cloudera.com wrote:

You have one worker with one executor with 32 execution slots.

On Mon, May 11, 2015 at 9:52 PM, dgoldenberg dgoldenberg...@gmail.com
wrote:
Hi,

Is there anything special one must do, running locally and submitting
a
job
like so:

spark-submit \
--class com.myco.Driver \
--master local[*] \
./lib/myco.jar

In my logs, I'm only seeing log messages with the thread identifier of
Executor task launch worker-0.

There are 4 cores on the machine so I expected 4 threads to be at
play.
Running with local[32] did not yield 32 worker threads.

Any recommendations? Thanks.

--
View this message in context:

http://apache-spark-user-list.1001560.n3.nabble.com/Running-Spark-in-local-mode-seems-to-ignore-local-N-tp22851.html
Sent from the Apache Spark User List mailing list archive at
Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Kafka stream fails: java.lang.NoClassDefFound com/yammer/metrics/core/Gauge

Ah yes, the Kafka + streaming code isn't in the assembly, is it? you'd
have to provide it and all its dependencies with your app. You could
also build this into your own app jar. Tools like Maven will add in
the transitive dependencies.

On Mon, May 11, 2015 at 10:04 PM, Lee McFadden splee...@gmail.com wrote:
 Thanks Ted,

 The issue is that I'm using packages (see spark-submit definition) and I do
 not know how to add com.yammer.metrics:metrics-core to my classpath so Spark
 can see it.

 Should metrics-core not be part of the
 org.apache.spark:spark-streaming-kafka_2.10:1.3.1 package so it can work
 correctly?

 If not, any clues as to how I can add metrics-core to my project (bearing in
 mind that I'm using Python, not a JVM language) would be much appreciated.

 Thanks, and apologies for my newbness with Java/Scala.

 On Mon, May 11, 2015 at 1:42 PM Ted Yu yuzhih...@gmail.com wrote:

 com.yammer.metrics.core.Gauge is in metrics-core jar
 e.g., in master branch:
 [INFO] |  \- org.apache.kafka:kafka_2.10:jar:0.8.1.1:compile
 [INFO] | +- com.yammer.metrics:metrics-core:jar:2.2.0:compile

 Please make sure metrics-core jar is on the classpath.

 On Mon, May 11, 2015 at 1:32 PM, Lee McFadden splee...@gmail.com wrote:

 Hi,

 We've been having some issues getting spark streaming running correctly
 using a Kafka stream, and we've been going around in circles trying to
 resolve this dependency.

 Details of our environment and the error below, if anyone can help
 resolve this it would be much appreciated.

 Submit command line:

 /home/ubuntu/spark/spark-1.3.1/bin/spark-submit \
 --packages
 TargetHolding/pyspark-cassandra:0.1.4,org.apache.spark:spark-streaming-kafka_2.10:1.3.1
 \
 --conf
 spark.cassandra.connection.host=10.10.103.172,10.10.102.160,10.10.101.79 \
 --master spark://127.0.0.1:7077 \
 affected_hosts.py

 When we run the streaming job everything starts just fine, then we see
 the following in the logs:

 15/05/11 19:50:46 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID
 70, ip-10-10-102-53.us-west-2.compute.internal):
 java.lang.NoClassDefFoundError: com/yammer/metrics/core/Gauge
 at
 kafka.consumer.ZookeeperConsumerConnector.createFetcher(ZookeeperConsumerConnector.scala:151)
 at
 kafka.consumer.ZookeeperConsumerConnector.init(ZookeeperConsumerConnector.scala:115)
 at
 kafka.consumer.ZookeeperConsumerConnector.init(ZookeeperConsumerConnector.scala:128)
 at kafka.consumer.Consumer$.create(ConsumerConnector.scala:89)
 at
 org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:100)
 at
 org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121)
 at
 org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106)
 at
 org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$8.apply(ReceiverTracker.scala:298)
 at
 org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$8.apply(ReceiverTracker.scala:290)
 at
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498)
 at
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498)
 at
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:64)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.ClassNotFoundException:
 com.yammer.metrics.core.Gauge
 at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
 ... 17 more





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Met a problem when using spark to load parquet files with different version schemas

2015-05-11 Thread Wei Yan

Creating dataframes and union them looks reasonable.

thanks,
Wei


On Mon, May 11, 2015 at 6:39 PM, Michael Armbrust mich...@databricks.com
wrote:

 Ah, yeah sorry.  I should have read closer and realized that what you are
 asking for is not supported.  It might be possible to add simple coercions
 such as this one, but today, compatible schemas must only add/remove
 columns and cannot change types.

 You could try creating different dataframes and unionAll them.  Coercions
 should be inserted automatically in that case.

 On Mon, May 11, 2015 at 3:37 PM, Wei Yan ywsk...@gmail.com wrote:

 Thanks for the reply, Michael.

 The problem is, if I set spark.sql.parquet.useDataSourceApi to true,
 spark cannot create a DataFrame. The exception shows it failed to merge
 incompatible schemas. I think here it means that, the int schema cannot
 be merged with the long one.
 Does it mean that the schema merging doesn't support the same field with
 different types?

 -Wei

 On Mon, May 11, 2015 at 3:10 PM, Michael Armbrust mich...@databricks.com
  wrote:

 BTW, I use spark 1.3.1, and already set
 spark.sql.parquet.useDataSourceApi to false.


 Schema merging is only supported when this flag is set to true (setting
 it to false uses old code that will be removed once the new code is
 proven).

Re: Spark SQL: STDDEV working in Spark Shell but not in a standalone app

I doubt that will make it as we are pretty slammed with other things and
the author needs to address the comments / merge conflict still.

I'll add that in general I recommend users use the HiveContext, even if
they aren't using Hive at all. Its a strict super set of the functionality
provided by SQLContext and the parser is much more powerful.

On Mon, May 11, 2015 at 4:00 PM, Oleg Shirokikh o...@solver.com wrote:

Michael – Thanks for the response – that’s right, I haven’t noticed that
Spark Shell instantiates sqlContext as a HiveContext, not actual Spark SQL
Context… I’ve seen the PR to add STDDEV to data frames.. Can I expect this
to be added to Spark SQL in Spark 1.4 or it’s still uncertain? It would be
really helpful to know in order to understand if I have to change existing
code to use HiveContext instead of SQLContext (which would be undesired)…
Thanks!

*From:* Michael Armbrust [mailto:mich...@databricks.com]
*Sent:* Saturday, May 09, 2015 11:32 AM
*To:* Oleg Shirokikh
*Cc:* user
*Subject:* Re: Spark SQL: STDDEV working in Spark Shell but not in a
standalone app

Are you perhaps using a HiveContext in the shell but a SQLContext in your
app? I don't think we natively implement stddev until 1.4.0

On Fri, May 8, 2015 at 4:44 PM, barmaley o...@solver.com wrote:

Given a registered table from data frame, I'm able to execute queries like
sqlContext.sql(SELECT STDDEV(col1) FROM table) from Spark Shell just
fine.
However, when I run exactly the same code in a standalone app on a cluster,
it throws an exception: java.util.NoSuchElementException: key not found:
STDDEV...

Is STDDEV ia among default functions in Spark SQL? I'd appreciate if you
could comment what's going on with the above.

Thanks

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-STDDEV-working-in-Spark-Shell-but-not-in-a-standalone-app-tp22825.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Running Spark in local mode seems to ignore local[N]