Re: SparkSQL issue: Spark 1.3.1 + hadoop 2.6 on CDH5.3 with parquet

2016-06-20 Thread Satya
Hello,
We are also experiencing the same error.  Can you please provide the steps
that resolved the issue.
Thanks
Satya



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-issue-Spark-1-3-1-hadoop-2-6-on-CDH5-3-with-parquet-tp22808p27197.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: FAILED_TO_UNCOMPRESS Error - Spark 1.3.1

2016-05-30 Thread Takeshi Yamamuro
Hi,

This is a known issue.
You need to check a related JIRA ticket:
https://issues.apache.org/jira/browse/SPARK-4105

// maropu

On Mon, May 30, 2016 at 7:51 PM, Prashant Singh Thakur <
prashant.tha...@impetus.co.in> wrote:

> Hi,
>
>
>
> We are trying to use Spark Data Frames for our use case where we are
> getting this exception.
>
> The parameters used are listed below. Kindly suggest if we are missing
> something.
>
> Version used is Spark 1.3.1
>
> Jira is still showing this issue as Open
> https://issues.apache.org/jira/browse/SPARK-4105
>
> Kindly suggest if there is workaround .
>
>
>
> Exception :
>
> Caused by: org.apache.spark.SparkException: Job aborted due to stage
> failure: Task 88 in stage 40.0 failed 4 times, most recent failure: Lost
> task 88.3 in stage 40.0 : java.io.IOException: FAILED_TO_UNCOMPRESS(5)
>
>   at
> org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78)
>
>   at org.xerial.snappy.SnappyNative.rawUncompress(Native
> Method)
>
>   at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:391)
>
>   at org.xerial.snappy.Snappy.uncompress(Snappy.java:427)
>
>   at
> org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127)
>
>   at
> org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88)
>
>   at
> org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58)
>
>   at
> org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:160)
>
>   at
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$7.apply(TorrentBroadcast.scala:213)
>
>   at
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$7.apply(TorrentBroadcast.scala:213)
>
>   at scala.Option.map(Option.scala:145)
>
>   at
> org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:213)
>
>   at
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:177)
>
>   at
> org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1153)
>
>   at
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164)
>
>   at
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
>
>   at
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
>
>   at
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87)
>
>   at
> org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
>
>   at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:61)
>
>   at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>
>   at org.apache.spark.scheduler.Task.run(Task.scala:64)
>
>   at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
>
>   at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
>   at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
>   at java.lang.Thread.run(Thread.java:745)
>
>
>
> Parameters Changed :
>
> spark.akka.frameSize=50
>
> spark.shuffle.memoryFraction=0.4
>
> spark.storage.memoryFraction=0.5
>
> spark.worker.timeout=12
>
> spark.storage.blockManagerSlaveTimeoutMs=12
>
> spark.akka.heartbeat.pauses=6000
>
> spark.akka.heartbeat.interval=1000
>
> spark.ui.port=21000
>
> spark.port.maxRetries=50
>
> spark.executor.memory=10G
>
> spark.executor.instances=100
>
> spark.driver.memory=8G
>
> spark.executor.cores=2
>
> spark.shuffle.compress=true
>
> spark.io.compression.codec=snappy
>
> spark.broadcast.compress=true
>
> spark.rdd.compress=true
>
> spark.worker.cleanup.enabled=true
>
> spark.worker.cleanup.interval=600
>
> spark.worker.cleanup.appDataTtl=600
>
> spark.shuffle.consolidateFiles=true
>
> spark.yarn.preserve.staging.files=false
>
> spark.yarn.driver.memoryOverhead=1024
>
> spark.yarn.executor.memoryOverhead=1024
>
>
>
> Best Regards,
>
> Prashant Singh Thakur
>
> Mobile: +91-9740266522
>
>
>
> --
>
>
>
>
>
>
> NOTE: This message may contain information that is confidential,
> proprietary, privileged or otherwise protected by law. The message is
> intended solely for the named addressee. If received in error, please
> destroy and notify the sender. Any use of this email is prohibited when
> received in error. Impetus does not represent, warrant and/or guarantee,
> that the integrity of this communication has been maintained nor that the
> communication is free of errors, virus, interception or interference.
>



-- 
---
Takeshi Yamamuro


FAILED_TO_UNCOMPRESS Error - Spark 1.3.1

2016-05-30 Thread Prashant Singh Thakur
Hi,

We are trying to use Spark Data Frames for our use case where we are getting 
this exception.
The parameters used are listed below. Kindly suggest if we are missing 
something.
Version used is Spark 1.3.1
Jira is still showing this issue as Open 
https://issues.apache.org/jira/browse/SPARK-4105
Kindly suggest if there is workaround .

Exception :
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 88 in stage 40.0 failed 4 times, most recent failure: Lost task 88.3 in 
stage 40.0 : java.io.IOException: FAILED_TO_UNCOMPRESS(5)
  at 
org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78)
  at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
  at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:391)
  at org.xerial.snappy.Snappy.uncompress(Snappy.java:427)
  at 
org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127)
  at 
org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88)
  at 
org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58)
  at 
org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:160)
  at 
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$7.apply(TorrentBroadcast.scala:213)
  at 
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$7.apply(TorrentBroadcast.scala:213)
  at scala.Option.map(Option.scala:145)
  at 
org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:213)
  at 
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:177)
  at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1153)
  at 
org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164)
  at 
org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
  at 
org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
  at 
org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87)
  at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
  at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:61)
  at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
  at org.apache.spark.scheduler.Task.run(Task.scala:64)
  at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  at java.lang.Thread.run(Thread.java:745)

Parameters Changed :
spark.akka.frameSize=50
spark.shuffle.memoryFraction=0.4
spark.storage.memoryFraction=0.5
spark.worker.timeout=12
spark.storage.blockManagerSlaveTimeoutMs=12
spark.akka.heartbeat.pauses=6000
spark.akka.heartbeat.interval=1000
spark.ui.port=21000
spark.port.maxRetries=50
spark.executor.memory=10G
spark.executor.instances=100
spark.driver.memory=8G
spark.executor.cores=2
spark.shuffle.compress=true
spark.io.compression.codec=snappy
spark.broadcast.compress=true
spark.rdd.compress=true
spark.worker.cleanup.enabled=true
spark.worker.cleanup.interval=600
spark.worker.cleanup.appDataTtl=600
spark.shuffle.consolidateFiles=true
spark.yarn.preserve.staging.files=false
spark.yarn.driver.memoryOverhead=1024
spark.yarn.executor.memoryOverhead=1024

Best Regards,
Prashant Singh Thakur
Mobile: +91-9740266522









NOTE: This message may contain information that is confidential, proprietary, 
privileged or otherwise protected by law. The message is intended solely for 
the named addressee. If received in error, please destroy and notify the 
sender. Any use of this email is prohibited when received in error. Impetus 
does not represent, warrant and/or guarantee, that the integrity of this 
communication has been maintained nor that the communication is free of errors, 
virus, interception or interference.


Re: Migrating Transformers from Spark 1.3.1 to 1.5.0

2016-02-15 Thread Cesar Flores
I found my problem. I was calling setParameterValue(defaultValue) more than
one time in the hierarchy of my classes.




Thanks!

On Mon, Feb 15, 2016 at 6:34 PM, Cesar Flores  wrote:

>
> I have a set of transformers (each with specific parameters) in spark
> 1.3.1. I have two versions, one that works and one that does not:
>
> 1.- working version
> //featureprovidertransformer contains already a set of ml params
> class DemographicTransformer(override val uid: String) extends
> FeatureProviderTransformer {
>
>   def this() = this(Identifiable.randomUID("demo-transformer"))
>   override def copy(extra: ParamMap): DemographicTransformer =
> defaultCopy(extra)
>
>   
>
> }
>
> 2.- not working version
> class DemographicTransformer(override val uid: String) extends
> FeatureProviderTransformer {
>
>   def this() = this(Identifiable.randomUID("demo-transformer"))
>   override def copy(extra: ParamMap): DemographicTransformer =
> defaultCopy(extra)
>
>   *//add another transformer parameter*
>   final val anotherParam: Param[String] = new Param[String](this,
> "anotherParam", "dummy parameter")
>   
>
> }
>
> Somehow adding an *anotherParam* to my class make it fail, with the
> following error:
>
> [info]   java.lang.NullPointerException:
> [info]   at
> org.apache.spark.ml.param.Params$$anonfun$hasParam$1.apply(params.scala:408)
> [info]   at
> org.apache.spark.ml.param.Params$$anonfun$hasParam$1.apply(params.scala:408)
> [info]   at
> scala.collection.IndexedSeqOptimized$$anonfun$exists$1.apply(IndexedSeqOptimized.scala:40)
> [info]   at
> scala.collection.IndexedSeqOptimized$$anonfun$exists$1.apply(IndexedSeqOptimized.scala:40)
> [info]   at
> scala.collection.IndexedSeqOptimized$class.segmentLength(IndexedSeqOptimized.scala:189)
> [info]   at
> scala.collection.mutable.ArrayOps$ofRef.segmentLength(ArrayOps.scala:108)
> [info]   at
> scala.collection.GenSeqLike$class.prefixLength(GenSeqLike.scala:92)
> [info]   at
> scala.collection.mutable.ArrayOps$ofRef.prefixLength(ArrayOps.scala:108)
> [info]   at
> scala.collection.IndexedSeqOptimized$class.exists(IndexedSeqOptimized.scala:40)
> [info]   at
> scala.collection.mutable.ArrayOps$ofRef.exists(ArrayOps.scala:108)
>
> Debugging the params.scala class shows me that actually adding
> *anotherParam*  *replace all parameters by a single one called allParams.*
>
> *Does anyone have any idea of what I may be doing wrong. My guess is that
> I am doing something weird in my class hierarchy but can not figure out
> what.*
>
>
> Thanks!
> --
> Cesar Flores
>



-- 
Cesar Flores


Migrating Transformers from Spark 1.3.1 to 1.5.0

2016-02-15 Thread Cesar Flores
I have a set of transformers (each with specific parameters) in spark
1.3.1. I have two versions, one that works and one that does not:

1.- working version
//featureprovidertransformer contains already a set of ml params
class DemographicTransformer(override val uid: String) extends
FeatureProviderTransformer {

  def this() = this(Identifiable.randomUID("demo-transformer"))
  override def copy(extra: ParamMap): DemographicTransformer =
defaultCopy(extra)

  

}

2.- not working version
class DemographicTransformer(override val uid: String) extends
FeatureProviderTransformer {

  def this() = this(Identifiable.randomUID("demo-transformer"))
  override def copy(extra: ParamMap): DemographicTransformer =
defaultCopy(extra)

  *//add another transformer parameter*
  final val anotherParam: Param[String] = new Param[String](this,
"anotherParam", "dummy parameter")
  

}

Somehow adding an *anotherParam* to my class make it fail, with the
following error:

[info]   java.lang.NullPointerException:
[info]   at
org.apache.spark.ml.param.Params$$anonfun$hasParam$1.apply(params.scala:408)
[info]   at
org.apache.spark.ml.param.Params$$anonfun$hasParam$1.apply(params.scala:408)
[info]   at
scala.collection.IndexedSeqOptimized$$anonfun$exists$1.apply(IndexedSeqOptimized.scala:40)
[info]   at
scala.collection.IndexedSeqOptimized$$anonfun$exists$1.apply(IndexedSeqOptimized.scala:40)
[info]   at
scala.collection.IndexedSeqOptimized$class.segmentLength(IndexedSeqOptimized.scala:189)
[info]   at
scala.collection.mutable.ArrayOps$ofRef.segmentLength(ArrayOps.scala:108)
[info]   at
scala.collection.GenSeqLike$class.prefixLength(GenSeqLike.scala:92)
[info]   at
scala.collection.mutable.ArrayOps$ofRef.prefixLength(ArrayOps.scala:108)
[info]   at
scala.collection.IndexedSeqOptimized$class.exists(IndexedSeqOptimized.scala:40)
[info]   at
scala.collection.mutable.ArrayOps$ofRef.exists(ArrayOps.scala:108)

Debugging the params.scala class shows me that actually adding
*anotherParam*  *replace all parameters by a single one called allParams.*

*Does anyone have any idea of what I may be doing wrong. My guess is that I
am doing something weird in my class hierarchy but can not figure out what.*


Thanks!
-- 
Cesar Flores


Re: Spark 1.3.1 - Does SparkConext in multi-threaded env requires SparkEnv.set(env) anymore

2015-12-10 Thread Josh Rosen
Nope, you shouldn't have to do that anymore. As of
https://github.com/apache/spark/pull/2624, which is in Spark 1.2.0+,
SparkEnv's thread-local stuff was removed and replaced by a simple global
variable (since it was used in an *effectively* global way before (see my
comments on that PR)). As a result, there shouldn't really be any need for
you to call SparkEnv.set(env) in your user threads anymore.

On Thu, Dec 10, 2015 at 11:03 AM, Nirav Patel  wrote:

> As subject says, do we still need to use static env in every thread that
> access sparkContext? I read some ref here.
>
>
> http://qnalist.com/questions/4956211/is-spark-context-in-local-mode-thread-safe
>
>
>
> [image: What's New with Xactly] 
>
>   [image: LinkedIn]
>   [image: Twitter]
>   [image: Facebook]
>   [image: YouTube]
> 


Spark 1.3.1 - Does SparkConext in multi-threaded env requires SparkEnv.set(env) anymore

2015-12-10 Thread Nirav Patel
As subject says, do we still need to use static env in every thread that
access sparkContext? I read some ref here.

http://qnalist.com/questions/4956211/is-spark-context-in-local-mode-thread-safe

-- 


[image: What's New with Xactly] 

  [image: LinkedIn] 
  [image: Twitter] 
  [image: Facebook] 
  [image: YouTube] 



Re: How to handle the UUID in Spark 1.3.1

2015-10-09 Thread Ted Yu
I guess that should work :-)

On Fri, Oct 9, 2015 at 10:46 AM, java8964  wrote:

> Thanks, Ted.
>
> Does this mean I am out of luck for now? If I use HiveContext, and cast
> the UUID as string, will it work?
>
> Yong
>
> --
> Date: Fri, 9 Oct 2015 09:09:38 -0700
> Subject: Re: How to handle the UUID in Spark 1.3.1
> From: yuzhih...@gmail.com
> To: java8...@hotmail.com
> CC: user@spark.apache.org
>
>
> This is related:
> SPARK-10501
>
> On Fri, Oct 9, 2015 at 7:28 AM, java8964  wrote:
>
> Hi,  Sparkers:
>
> In this case, I want to use Spark as an ETL engine to load the data from
> Cassandra, and save it into HDFS.
>
> Here is the environment specified information:
>
> Spark 1.3.1
> Cassandra 2.1
> HDFS/Hadoop 2.2
>
> I am using the Cassandra Spark Connector 1.3.x, which I have no problem to
> query the C* data in the Spark. But I have a problem trying to save the
> data into HDFS, like below:
>
> val df = sqlContext.load("org.apache.spark.sql.cassandra", options = Map(
> "c_table" -> "table_name", "keyspace" -> "keyspace_name")
> df: org.apache.spark.sql.DataFrame = [account_id: bigint, campaign_id:
> uuid, business_info_ids: array, closed_date: timestamp,
> compliance_hold: boolean, contacts_list_id: uuid, contacts_list_seq:
> bigint, currency_type: string, deleted_date: timestamp, discount_info:
> map, end_date: timestamp, insert_by: string, insert_time:
> timestamp, last_update_by: string, last_update_time: timestamp, name:
> string, parent_id: uuid, publish_date: timestamp, share_incentive:
> map, start_date: timestamp, version: int]
>
> scala> df.count
> res12: Long = 757704
>
> I can also dump the data output suing df.first, without any problem.
>
> But when I try to save it:
>
> scala> df.save("hdfs://location", "parquet")
> java.lang.RuntimeException: Unsupported datatype UUIDType
> at scala.sys.package$.error(package.scala:27)
> at
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:372)
> at
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:316)
> at scala.Option.getOrElse(Option.scala:120)
> at
> org.apache.spark.sql.parquet.ParquetTypesConverter$.fromDataType(ParquetTypes.scala:315)
> at
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$4.apply(ParquetTypes.scala:395)
> at
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$4.apply(ParquetTypes.scala:394)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at
> org.apache.spark.sql.parquet.ParquetTypesConverter$.convertFromAttributes(ParquetTypes.scala:393)
> at
> org.apache.spark.sql.parquet.ParquetTypesConverter$.writeMetaData(ParquetTypes.scala:440)
> at
> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.prepareMetadata(newParquet.scala:260)
> at
> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:276)
> at
> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:269)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at
> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:269)
> at
> org.apache.spark.sql.parquet.ParquetRelation2.(newParquet.scala:391)
> at
> org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.scala:98)
> at
> org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.scala:128)
> at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:240)
> at org.apache.spark.sql.DataFrame.save(DataFrame.scala:1196)
> at org.apache.spark.sql.DataFrame.save(DataFrame.scala:1156)
> at
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:28)
> at
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:35)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:37)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$

RE: How to handle the UUID in Spark 1.3.1

2015-10-09 Thread java8964
Thanks, Ted.
Does this mean I am out of luck for now? If I use HiveContext, and cast the 
UUID as string, will it work?
Yong

Date: Fri, 9 Oct 2015 09:09:38 -0700
Subject: Re: How to handle the UUID in Spark 1.3.1
From: yuzhih...@gmail.com
To: java8...@hotmail.com
CC: user@spark.apache.org

This is related:SPARK-10501

On Fri, Oct 9, 2015 at 7:28 AM, java8964  wrote:



Hi,  Sparkers:
In this case, I want to use Spark as an ETL engine to load the data from 
Cassandra, and save it into HDFS.
Here is the environment specified information:
Spark 1.3.1Cassandra 2.1HDFS/Hadoop 2.2
I am using the Cassandra Spark Connector 1.3.x, which I have no problem to 
query the C* data in the Spark. But I have a problem trying to save the data 
into HDFS, like below:
val df = sqlContext.load("org.apache.spark.sql.cassandra", options = Map( 
"c_table" -> "table_name", "keyspace" -> "keyspace_name")df: 
org.apache.spark.sql.DataFrame = [account_id: bigint, campaign_id: uuid, 
business_info_ids: array, closed_date: timestamp, compliance_hold: 
boolean, contacts_list_id: uuid, contacts_list_seq: bigint, currency_type: 
string, deleted_date: timestamp, discount_info: map, end_date: 
timestamp, insert_by: string, insert_time: timestamp, last_update_by: string, 
last_update_time: timestamp, name: string, parent_id: uuid, publish_date: 
timestamp, share_incentive: map, start_date: timestamp, version: 
int]
scala> df.countres12: Long = 757704
I can also dump the data output suing df.first, without any problem.
But when I try to save it:
scala> df.save("hdfs://location", "parquet")java.lang.RuntimeException: 
Unsupported datatype UUIDType   at scala.sys.package$.error(package.scala:27)   
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:372)
 at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:316)
 at scala.Option.getOrElse(Option.scala:120) at 
org.apache.spark.sql.parquet.ParquetTypesConverter$.fromDataType(ParquetTypes.scala:315)
 at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$4.apply(ParquetTypes.scala:395)
  at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$4.apply(ParquetTypes.scala:394)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at scala.collection.immutable.List.foreach(List.scala:318)  at 
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)at 
scala.collection.AbstractTraversable.map(Traversable.scala:105)  at 
org.apache.spark.sql.parquet.ParquetTypesConverter$.convertFromAttributes(ParquetTypes.scala:393)
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$.writeMetaData(ParquetTypes.scala:440)
at 
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.prepareMetadata(newParquet.scala:260)
at 
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:276)
   at 
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:269)
   at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at scala.collection.immutable.List.foreach(List.scala:318)  at 
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)at 
scala.collection.AbstractTraversable.map(Traversable.scala:105)  at 
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:269)
at 
org.apache.spark.sql.parquet.ParquetRelation2.(newParquet.scala:391)   at 
org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.scala:98)  
 at 
org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.scala:128) 
 at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:240)   
 at org.apache.spark.sql.DataFrame.save(DataFrame.scala:1196)at 
org.apache.spark.sql.DataFrame.save(DataFrame.scala:1156)at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:28)
 at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)  at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:35)   at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:37)at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:39) at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:41)  at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:43)   at 
$iwC$$iwC$$iwC$$iwC$$iwC.(:45)at 
$iwC$$iwC$$iwC$$iwC.(:47) at 
$iwC$$iwC$$iwC.(:49)  at $iwC$$iwC.(:51)   at 
$iwC.(:53)at (:55) at .(:59)   
 at .() at .(:7) at .() at 
$print()at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method)  at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMetho

Re: How to handle the UUID in Spark 1.3.1

2015-10-09 Thread Ted Yu
This is related:
SPARK-10501

On Fri, Oct 9, 2015 at 7:28 AM, java8964  wrote:

> Hi,  Sparkers:
>
> In this case, I want to use Spark as an ETL engine to load the data from
> Cassandra, and save it into HDFS.
>
> Here is the environment specified information:
>
> Spark 1.3.1
> Cassandra 2.1
> HDFS/Hadoop 2.2
>
> I am using the Cassandra Spark Connector 1.3.x, which I have no problem to
> query the C* data in the Spark. But I have a problem trying to save the
> data into HDFS, like below:
>
> val df = sqlContext.load("org.apache.spark.sql.cassandra", options = Map(
> "c_table" -> "table_name", "keyspace" -> "keyspace_name")
> df: org.apache.spark.sql.DataFrame = [account_id: bigint, campaign_id:
> uuid, business_info_ids: array, closed_date: timestamp,
> compliance_hold: boolean, contacts_list_id: uuid, contacts_list_seq:
> bigint, currency_type: string, deleted_date: timestamp, discount_info:
> map, end_date: timestamp, insert_by: string, insert_time:
> timestamp, last_update_by: string, last_update_time: timestamp, name:
> string, parent_id: uuid, publish_date: timestamp, share_incentive:
> map, start_date: timestamp, version: int]
>
> scala> df.count
> res12: Long = 757704
>
> I can also dump the data output suing df.first, without any problem.
>
> But when I try to save it:
>
> scala> df.save("hdfs://location", "parquet")
> java.lang.RuntimeException: Unsupported datatype UUIDType
> at scala.sys.package$.error(package.scala:27)
> at
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:372)
> at
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:316)
> at scala.Option.getOrElse(Option.scala:120)
> at
> org.apache.spark.sql.parquet.ParquetTypesConverter$.fromDataType(ParquetTypes.scala:315)
> at
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$4.apply(ParquetTypes.scala:395)
> at
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$4.apply(ParquetTypes.scala:394)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at
> org.apache.spark.sql.parquet.ParquetTypesConverter$.convertFromAttributes(ParquetTypes.scala:393)
> at
> org.apache.spark.sql.parquet.ParquetTypesConverter$.writeMetaData(ParquetTypes.scala:440)
> at
> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.prepareMetadata(newParquet.scala:260)
> at
> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:276)
> at
> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:269)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at
> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:269)
> at
> org.apache.spark.sql.parquet.ParquetRelation2.(newParquet.scala:391)
> at
> org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.scala:98)
> at
> org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.scala:128)
> at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:240)
> at org.apache.spark.sql.DataFrame.save(DataFrame.scala:1196)
> at org.apache.spark.sql.DataFrame.save(DataFrame.scala:1156)
> at
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:28)
> at
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:35)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:37)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:39)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:41)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:43)
> at $iwC$$iwC$$iwC$$iwC$$iwC.(:45)
> at $iwC$$iwC$$iwC$$iwC.(:47)
> at $iwC$$iwC$$iwC.(:49)
> at $iwC$$iwC.(:51)
> at $iwC.(:53)
> at (:55)
> at .(:59)
> at .()
> at .(:7)
> at .()
> at $print()
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.jav

How to handle the UUID in Spark 1.3.1

2015-10-09 Thread java8964
Hi,  Sparkers:
In this case, I want to use Spark as an ETL engine to load the data from 
Cassandra, and save it into HDFS.
Here is the environment specified information:
Spark 1.3.1Cassandra 2.1HDFS/Hadoop 2.2
I am using the Cassandra Spark Connector 1.3.x, which I have no problem to 
query the C* data in the Spark. But I have a problem trying to save the data 
into HDFS, like below:
val df = sqlContext.load("org.apache.spark.sql.cassandra", options = Map( 
"c_table" -> "table_name", "keyspace" -> "keyspace_name")df: 
org.apache.spark.sql.DataFrame = [account_id: bigint, campaign_id: uuid, 
business_info_ids: array, closed_date: timestamp, compliance_hold: 
boolean, contacts_list_id: uuid, contacts_list_seq: bigint, currency_type: 
string, deleted_date: timestamp, discount_info: map, end_date: 
timestamp, insert_by: string, insert_time: timestamp, last_update_by: string, 
last_update_time: timestamp, name: string, parent_id: uuid, publish_date: 
timestamp, share_incentive: map, start_date: timestamp, version: 
int]
scala> df.countres12: Long = 757704
I can also dump the data output suing df.first, without any problem.
But when I try to save it:
scala> df.save("hdfs://location", "parquet")java.lang.RuntimeException: 
Unsupported datatype UUIDType   at scala.sys.package$.error(package.scala:27)   
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:372)
 at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:316)
 at scala.Option.getOrElse(Option.scala:120) at 
org.apache.spark.sql.parquet.ParquetTypesConverter$.fromDataType(ParquetTypes.scala:315)
 at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$4.apply(ParquetTypes.scala:395)
  at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$4.apply(ParquetTypes.scala:394)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at scala.collection.immutable.List.foreach(List.scala:318)  at 
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)at 
scala.collection.AbstractTraversable.map(Traversable.scala:105)  at 
org.apache.spark.sql.parquet.ParquetTypesConverter$.convertFromAttributes(ParquetTypes.scala:393)
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$.writeMetaData(ParquetTypes.scala:440)
at 
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.prepareMetadata(newParquet.scala:260)
at 
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:276)
   at 
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:269)
   at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at scala.collection.immutable.List.foreach(List.scala:318)  at 
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)at 
scala.collection.AbstractTraversable.map(Traversable.scala:105)  at 
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:269)
at 
org.apache.spark.sql.parquet.ParquetRelation2.(newParquet.scala:391)   at 
org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.scala:98)  
 at 
org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.scala:128) 
 at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:240)   
 at org.apache.spark.sql.DataFrame.save(DataFrame.scala:1196)at 
org.apache.spark.sql.DataFrame.save(DataFrame.scala:1156)at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:28)
 at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)  at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:35)   at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:37)at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:39) at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:41)  at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:43)   at 
$iwC$$iwC$$iwC$$iwC$$iwC.(:45)at 
$iwC$$iwC$$iwC$$iwC.(:47) at 
$iwC$$iwC$$iwC.(:49)  at $iwC$$iwC.(:51)   at 
$iwC.(:53)at (:55) at .(:59)   
 at .() at .(:7) at .() at 
$print()at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method)  at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)   
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606) at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)   at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)   at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)   at 
org.apache.spark.repl.SparkIMain.interpret(SparkIMain.sca

Re: Spark 1.3.1 on Yarn not using all given capacity

2015-10-06 Thread Cesar Berezowski
3 cores* not 8

César.



> Le 6 oct. 2015 à 19:08, Cesar Berezowski  a écrit :
> 
> I deployed hdp 2.3.1 and got spark 1.3.1, spark 1.4 is supposed to be 
> available as technical preview I think
> 
> vendor’s forum ? you mean hortonworks' ? 
> 
> --
> Update on my info: 
> 
> Set Yarn to use 16 cores instead of 8 & set min container size to 4096mb
> Thus: 
> 12 executors, 12G of Ram and 8 cores
> 
> But same issue, still creates 3 container (+ driver), 1 core and 6.3gb each, 
> taking 16gb on yarn
> 
> César.
> 
> 
> 
>> Le 6 oct. 2015 à 19:00, Ted Yu > <mailto:yuzhih...@gmail.com>> a écrit :
>> 
>> Considering posting the question on vendor's forum.
>> 
>> HDP 2.3 comes with Spark 1.4 if I remember correctly.
>> 
>> On Tue, Oct 6, 2015 at 9:05 AM, czoo > <mailto:ce...@adaltas.com>> wrote:
>> Hi,
>> 
>> This post might be a duplicate with updates from another one (by me), sorry
>> in advance
>> 
>> I have an HDP 2.3 cluster running Spark 1.3.1 on 6 nodes (edge + master + 4
>> workers)
>> Each worker has 8 cores and 40G of RAM available in Yarn
>> 
>> That makes a total of 160GB and 32 cores
>> 
>> I'm running a job with the following parameters :
>> --master yarn-client
>> --num-executors 12 (-> 3 / node)
>> --executor-cores 2
>> --executor-memory 12G
>> 
>> I don't know if it's optimal but it should run (right ?)
>> 
>> However I end up with spark setting up 2 executors using 1 core & 6.2G each
>> 
>> Plus, my job is doing a cartesian product so I end up with a pretty big
>> DataFrame that inevitably ends on a GC exception...
>> It used to run on HDP2.2 / Spark 1.2.1 but I can't find any way to run it
>> now
>> 
>> Any Idea ?
>> 
>> Thanks a lot
>> 
>> Cesar
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-1-on-Yarn-not-using-all-given-capacity-tp24955.html
>>  
>> <http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-1-on-Yarn-not-using-all-given-capacity-tp24955.html>
>> Sent from the Apache Spark User List mailing list archive at Nabble.com 
>> <http://nabble.com/>.
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
>> <mailto:user-unsubscr...@spark.apache.org>
>> For additional commands, e-mail: user-h...@spark.apache.org 
>> <mailto:user-h...@spark.apache.org>
>> 
>> 
> 



Re: Spark 1.3.1 on Yarn not using all given capacity

2015-10-06 Thread Ted Yu
Considering posting the question on vendor's forum.

HDP 2.3 comes with Spark 1.4 if I remember correctly.

On Tue, Oct 6, 2015 at 9:05 AM, czoo  wrote:

> Hi,
>
> This post might be a duplicate with updates from another one (by me), sorry
> in advance
>
> I have an HDP 2.3 cluster running Spark 1.3.1 on 6 nodes (edge + master + 4
> workers)
> Each worker has 8 cores and 40G of RAM available in Yarn
>
> That makes a total of 160GB and 32 cores
>
> I'm running a job with the following parameters :
> --master yarn-client
> --num-executors 12 (-> 3 / node)
> --executor-cores 2
> --executor-memory 12G
>
> I don't know if it's optimal but it should run (right ?)
>
> However I end up with spark setting up 2 executors using 1 core & 6.2G each
>
> Plus, my job is doing a cartesian product so I end up with a pretty big
> DataFrame that inevitably ends on a GC exception...
> It used to run on HDP2.2 / Spark 1.2.1 but I can't find any way to run it
> now
>
> Any Idea ?
>
> Thanks a lot
>
> Cesar
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-1-on-Yarn-not-using-all-given-capacity-tp24955.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Spark 1.3.1 on Yarn not using all given capacity

2015-10-06 Thread czoo
Hi, 

This post might be a duplicate with updates from another one (by me), sorry
in advance

I have an HDP 2.3 cluster running Spark 1.3.1 on 6 nodes (edge + master + 4
workers) 
Each worker has 8 cores and 40G of RAM available in Yarn 

That makes a total of 160GB and 32 cores

I'm running a job with the following parameters : 
--master yarn-client
--num-executors 12 (-> 3 / node)
--executor-cores 2 
--executor-memory 12G 

I don't know if it's optimal but it should run (right ?)

However I end up with spark setting up 2 executors using 1 core & 6.2G each

Plus, my job is doing a cartesian product so I end up with a pretty big
DataFrame that inevitably ends on a GC exception...
It used to run on HDP2.2 / Spark 1.2.1 but I can't find any way to run it
now

Any Idea ?

Thanks a lot

Cesar



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-1-on-Yarn-not-using-all-given-capacity-tp24955.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark 1.3.1 saveAsParquetFile hangs on app exit

2015-08-26 Thread cingram
spark-shell-hang-on-exit.tdump

  





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-1-saveAsParquetFile-hangs-on-app-exit-tp24460p24461.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark 1.3.1 saveAsParquetFile hangs on app exit

2015-08-26 Thread Cheng Lian

Could you please show jstack result of the hanged process? Thanks!

Cheng

On 8/26/15 10:46 PM, cingram wrote:

I have a simple test that is hanging when using s3a with spark 1.3.1. Is
there something I need to do to cleanup the S3A file system? The write to S3
appears to have worked but this job hangs in the spark-shell and using
spark-submit. Any help would be greatly appreciated. TIA.

import sqlContext.implicits._
import com.datastax.spark.connector._
case class LU(userid: String, timestamp: Long, lat: Double, lon: Double)
val uid ="testuser"
val lue = sc.cassandraTable[LU]("test", "foo").where("userid=?", uid).toDF
lue.saveAsParquetFile("s3a://twc-scratch/craig_lues")



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-1-saveAsParquetFile-hangs-on-app-exit-tp24460.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark 1.3.1 saveAsParquetFile hangs on app exit

2015-08-26 Thread cingram
I have a simple test that is hanging when using s3a with spark 1.3.1. Is
there something I need to do to cleanup the S3A file system? The write to S3
appears to have worked but this job hangs in the spark-shell and using
spark-submit. Any help would be greatly appreciated. TIA.

import sqlContext.implicits._
import com.datastax.spark.connector._
case class LU(userid: String, timestamp: Long, lat: Double, lon: Double)
val uid ="testuser"
val lue = sc.cassandraTable[LU]("test", "foo").where("userid=?", uid).toDF
lue.saveAsParquetFile("s3a://twc-scratch/craig_lues")



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-1-saveAsParquetFile-hangs-on-app-exit-tp24460.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Running spark shell on mesos with zookeeper on spark 1.3.1

2015-08-24 Thread kohlisimranjit
I have setup up apache mesos using mesosphere on Cent OS 6 with Java 8.I have
3 slaves which total to 3 cores and 8 gb ram. I have set no firewalls. I am
trying to run the following lines of code to test whether the setup is
working:

 val data = 1 to 1
 val distData = sc.parallelize(data)
 distData.filter(_< 10).collect()

I get the following on my console
5/08/24 20:54:57 INFO SparkContext: Starting job: collect at :26
15/08/24 20:54:57 INFO DAGScheduler: Got job 0 (collect at :26)
with 8 output partitions (allowLocal=false)
15/08/24 20:54:57 INFO DAGScheduler: Final stage: Stage 0(collect at
:26)
15/08/24 20:54:57 INFO DAGScheduler: Parents of final stage: List()
15/08/24 20:54:57 INFO DAGScheduler: Missing parents: List()
15/08/24 20:54:57 INFO DAGScheduler: Submitting Stage 0 (MapPartitionsRDD[1]
at filter at :26), which has no missing parents
15/08/24 20:54:57 INFO MemoryStore: ensureFreeSpace(1792) called with
curMem=0, maxMem=280248975
15/08/24 20:54:57 INFO MemoryStore: Block broadcast_0 stored as values in
memory (estimated size 1792.0 B, free 267.3 MB)
15/08/24 20:54:57 INFO MemoryStore: ensureFreeSpace(1293) called with
curMem=1792, maxMem=280248975
15/08/24 20:54:57 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes
in memory (estimated size 1293.0 B, free 267.3 MB)
15/08/24 20:54:57 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory
on ip-172-31-46-176.ec2.internal:33361 (size: 1293.0 B, free: 267.3 MB)
15/08/24 20:54:57 INFO BlockManagerMaster: Updated info of block
broadcast_0_piece0
15/08/24 20:54:57 INFO SparkContext: Created broadcast 0 from broadcast at
DAGScheduler.scala:839
15/08/24 20:54:57 INFO DAGScheduler: Submitting 8 missing tasks from Stage 0
(MapPartitionsRDD[1] at filter at :26)
15/08/24 20:54:57 INFO TaskSchedulerImpl: Adding task set 0.0 with 8 tasks
15/08/24 20:55:12 WARN TaskSchedulerImpl: Initial job has not accepted any
resources; check your cluster UI to ensure that workers are registered and
have sufficient resources
15/08/24 20:55:27 WARN TaskSchedulerImpl: Initial job has not accepted any
resources; check your cluster UI to ensure that workers are registered and
have sufficient resources
.

Here are the logs /var/log/mesos attached







mesos-slave.22202

  
mesos-slave.22202

  
mesos-master.22181

  
mesos-master.22181

  





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Running-spark-shell-on-mesos-with-zookeeper-on-spark-1-3-1-tp24430.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



ClassCastException when saving a DataFrame to parquet file (saveAsParquetFile, Spark 1.3.1) using Scala

2015-08-21 Thread Emma Boya Peng
Hi,

I was trying to programmatically specify a schema and apply it to a RDD of
Rows and save the resulting DataFrame as a parquet file, but I got
"java.lang.ClassCastException:
java.lang.String cannot be cast to java.lang.Long" on the last step.

Here's what I did:

1. Created an RDD of Rows from RDD[Array[String]]:
val gameId= Long.valueOf(line(0))
  val accountType = Long.valueOf(line(1))
  val worldId = Long.valueOf(line(2))
  val dtEventTime = line(3)
  val iEventId = line(4)

  return Row(gameId, accountType, worldId, dtEventTime, iEventId)

2. Generate the schema:
 return StructType(Array(StructField("GameId", LongType, true),
StructField("AccountType", LongType, true), StructField("WorldId",
LongType, true), StructField("dtEventTime", StringType, true),
StructField("iEventId",
StringType, true)))

3. Apply the schema and apply it to the RDD of Rows:
val schemaRdd = sqlContext.createDataFrame(rowRdd, schema)

4. Save schemaRdd as a parquet file:
 schemaRdd.saveAsParquetFile(dst + "/" + tableName + ".parquet")

However, it gave me a ClassCastException on step 4 (the DataFrame, i.e.
schemaRdd, can be correctly printed out according to the specified schema).

Thank you for your help!

Best,
Emma

Stack trace of the exception:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 3
in stage 1.0 failed 4 times, most recent failure: Lost task 3.3 in stage
1.0 (TID 12, 10-4-28-24): java.lang.ClassCastException: java.lang.String
cannot be cast to java.lang.Long
at scala.runtime.BoxesRunTime.unboxToLong(BoxesRunTime.java:110)
at
org.apache.spark.sql.catalyst.expressions.GenericRow.getLong(rows.scala:88)
at
org.apache.spark.sql.parquet.MutableRowWriteSupport.consumeType(ParquetTableSupport.scala:357)
at
org.apache.spark.sql.parquet.MutableRowWriteSupport.write(ParquetTableSupport.scala:338)
at
org.apache.spark.sql.parquet.MutableRowWriteSupport.write(ParquetTableSupport.scala:324)
at
parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120)
at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81)
at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37)
at org.apache.spark.sql.parquet.ParquetRelation2.org
$apache$spark$sql$parquet$ParquetRelation2$$writeShard$1(newParquet.scala:671)
at
org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:689)
at
org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:689)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)


ClassCastException when saving a DataFrame to parquet file (saveAsParquetFile, Spark 1.3.1) using Scala

2015-08-21 Thread Emma Boya Peng
Hi,

I was trying to programmatically specify a schema and apply it to a RDD of
Rows and save the resulting DataFrame as a parquet file.
Here's what I did:

1. Created an RDD of Rows from RDD[Array[String]]:
val gameId= Long.valueOf(line(0))
  val accountType = Long.valueOf(line(1))
  val worldId = Long.valueOf(line(2))
  val dtEventTime = line(3)
  val iEventId = line(4)
  val vVersionId = line(5)
  val vUin = line(6)
  val vClientIp = line(7)
  val vZoneId = line(8)
  val dtCreateTime = line(9)
  val iFeeFlag = Long.valueOf(line(10))
  val vLoginWay = line(11)

  return Row(gameId, accountType, worldId, dtEventTime, iEventId,
vVersionId, vUin, vClientIp,
 vZoneId, dtCreateTime, vZoneId, dtCreateTime, iFeeFlag,
vLoginWay)


Re: intellij14 compiling spark-1.3.1 got error: assertion failed: com.google.protobuf.InvalidProtocalBufferException

2015-08-09 Thread longda...@163.com
the stack trace is below

Error:scalac: 
 while compiling:
/home/xiaoju/data/spark-1.3.1/core/src/main/scala/org/apache/spark/SparkContext.scala
during phase: typer
 library version: version 2.10.4
compiler version: version 2.10.4
  reconstructed args: -nobootcp -javabootclasspath :
-P:genjavadoc:out=/home/xiaoju/data/spark-1.3.1/core/target/java
-deprecation -feature -classpath
/usr/java/jdk1.7.0_71/jre/lib/resources.jar:/usr/java/jdk1.7.0_71/jre/lib/management-agent.jar:/usr/java/jdk1.7.0_71/jre/lib/charsets.jar:/usr/java/jdk1.7.0_71/jre/lib/rt.jar:/usr/java/jdk1.7.0_71/jre/lib/jsse.jar:/usr/java/jdk1.7.0_71/jre/lib/javaws.jar:/usr/java/jdk1.7.0_71/jre/lib/jfxrt.jar:/usr/java/jdk1.7.0_71/jre/lib/jfr.jar:/usr/java/jdk1.7.0_71/jre/lib/jce.jar:/usr/java/jdk1.7.0_71/jre/lib/deploy.jar:/usr/java/jdk1.7.0_71/jre/lib/plugin.jar:/usr/java/jdk1.7.0_71/jre/lib/ext/sunpkcs11.jar:/usr/java/jdk1.7.0_71/jre/lib/ext/dnsns.jar:/usr/java/jdk1.7.0_71/jre/lib/ext/localedata.jar:/usr/java/jdk1.7.0_71/jre/lib/ext/zipfs.jar:/usr/java/jdk1.7.0_71/jre/lib/ext/sunec.jar:/usr/java/jdk1.7.0_71/jre/lib/ext/sunjce_provider.jar:/home/xiaoju/data/spark-1.3.1/core/target/scala-2.10/classes:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/netty-all-4.0.23.Final.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/guava-14.0.1.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/bundles/guava-14.0.1.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/unused-1.0.0.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/chill_2.10-0.5.0.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/chill-java-0.5.0.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/kryo-2.21.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/bundles/kryo-2.21.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/reflectasm-1.07.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/reflectasm-1.07-shaded.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/minlog-1.2.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/objenesis-1.2.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/hadoop-client-1.0.4.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/hadoop-core-1.0.4.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/xmlenc-0.52.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/commons-codec-1.4.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/commons-math-2.1.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/commons-configuration-1.6.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/commons-collections-3.2.1.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/commons-lang-2.4.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/commons-logging-1.1.1.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/commons-digester-1.8.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/commons-beanutils-1.7.0.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/commons-beanutils-core-1.8.0.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/commons-el-1.0.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/hsqldb-1.8.0.10.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/oro-2.0.8.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/jackson-mapper-asl-1.0.1.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/jackson-core-asl-1.0.1.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/jets3t-0.7.1.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/commons-httpclient-3.1.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/curator-recipes-2.4.0.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/bundles/curator-recipes-2.4.0.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/curator-framework-2.4.0.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/bundles/curator-framework-2.4.0.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/curator-client-2.4.0.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/bundles/curator-client-2.4.0.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/zookeeper-3.4.5.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/jline-0.9.94.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/jetty-plus-8.1.14.v20131031.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/javax.transaction-1.1.1.v201105210645.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/orbits/javax.transaction-1.1.1.v201105210645.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/jetty-webapp-8.1.14.v20131031.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/jetty-xml-8.1.14.v20131031.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/jetty-util-8.1.14.v20131031.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/jetty-servlet-8.1.14.v20131031.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/jetty-security-8.1.14.v20131031.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/jetty-server-8.1.14.v20131031.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/javax.servlet-3.0.0.v201112011016.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/orbits/javax.servlet-3.0.0.v201112011016.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/jetty-continuation-8.1.14.v20131031.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/jetty-http-8.1.14.v20131031

Re:Re: intellij14 compiling spark-1.3.1 got error: assertion failed: com.google.protobuf.InvalidProtocalBufferException

2015-08-09 Thread 龙淡
thank you for reply,
   i use sbt to complie spark, but there are both protobuf 2.4.1 and 2.5.0 in 
maven repository , and protobuf 2.5.0 in .ivy repository, 
   the stack trace is below
   Error:scalac:
 while compiling: 
/home/xiaoju/data/spark-1.3.1/core/src/main/scala/org/apache/spark/SparkContext.scala
during phase: typer
 library version: version 2.10.4
compiler version: version 2.10.4
  reconstructed args: -nobootcp -javabootclasspath : 
-P:genjavadoc:out=/home/xiaoju/data/spark-1.3.1/core/target/java -deprecation 
-feature -classpath 
/usr/java/jdk1.7.0_71/jre/lib/resources.jar:/usr/java/jdk1.7.0_71/jre/lib/management-agent.jar:/usr/java/jdk1.7.0_71/jre/lib/charsets.jar:/usr/java/jdk1.7.0_71/jre/lib/rt.jar:/usr/java/jdk1.7.0_71/jre/lib/jsse.jar:/usr/java/jdk1.7.0_71/jre/lib/javaws.jar:/usr/java/jdk1.7.0_71/jre/lib/jfxrt.jar:/usr/java/jdk1.7.0_71/jre/lib/jfr.jar:/usr/java/jdk1.7.0_71/jre/lib/jce.jar:/usr/java/jdk1.7.0_71/jre/lib/deploy.jar:/usr/java/jdk1.7.0_71/jre/lib/plugin.jar:/usr/java/jdk1.7.0_71/jre/lib/ext/sunpkcs11.jar:/usr/java/jdk1.7.0_71/jre/lib/ext/dnsns.jar:/usr/java/jdk1.7.0_71/jre/lib/ext/localedata.jar:/usr/java/jdk1.7.0_71/jre/lib/ext/zipfs.jar:/usr/java/jdk1.7.0_71/jre/lib/ext/sunec.jar:/usr/java/jdk1.7.0_71/jre/lib/ext/sunjce_provider.jar:/home/xiaoju/data/spark-1.3.1/core/target/scala-2.10/classes:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/netty-all-4.0.23.Final.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/guava-14.0.1.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/bundles/guava-14.0.1.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/unused-1.0.0.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/chill_2.10-0.5.0.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/chill-java-0.5.0.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/kryo-2.21.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/bundles/kryo-2.21.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/reflectasm-1.07.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/reflectasm-1.07-shaded.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/minlog-1.2.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/objenesis-1.2.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/hadoop-client-1.0.4.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/hadoop-core-1.0.4.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/xmlenc-0.52.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/commons-codec-1.4.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/commons-math-2.1.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/commons-configuration-1.6.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/commons-collections-3.2.1.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/commons-lang-2.4.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/commons-logging-1.1.1.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/commons-digester-1.8.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/commons-beanutils-1.7.0.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/commons-beanutils-core-1.8.0.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/commons-el-1.0.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/hsqldb-1.8.0.10.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/oro-2.0.8.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/jackson-mapper-asl-1.0.1.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/jackson-core-asl-1.0.1.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/jets3t-0.7.1.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/commons-httpclient-3.1.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/curator-recipes-2.4.0.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/bundles/curator-recipes-2.4.0.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/curator-framework-2.4.0.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/bundles/curator-framework-2.4.0.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/curator-client-2.4.0.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/bundles/curator-client-2.4.0.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/zookeeper-3.4.5.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/jline-0.9.94.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/jetty-plus-8.1.14.v20131031.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/javax.transaction-1.1.1.v201105210645.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/orbits/javax.transaction-1.1.1.v201105210645.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/jetty-webapp-8.1.14.v20131031.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/jetty-xml-8.1.14.v20131031.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/jetty-util-8.1.14.v20131031.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/jetty-servlet-8.1.14.v20131031.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/jetty-security-8.1.14.v20131031.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/jetty-server-8.1.14.v20131031.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/jars/javax.servlet-3.0.0.v201112011016.jar:/home/xiaoju/data/spark-1.3.1/lib_managed/orbits/javax.servlet-3.0.0.v201112011016

Re: intellij14 compiling spark-1.3.1 got error: assertion failed: com.google.protobuf.InvalidProtocalBufferException

2015-08-09 Thread Ted Yu
Can you check if there is protobuf version other than 2.5.0 on the
classpath ?

Please show the complete stack trace.

Cheers

On Sun, Aug 9, 2015 at 9:41 AM, longda...@163.com  wrote:

> hi all,
> i compile spark-1.3.1 on linux use intellij14 and got error assertion
> failed: com.google.protobuf.InvalidProtocalBufferException, how could i
> solve the problem?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/intellij14-compiling-spark-1-3-1-got-error-assertion-failed-com-google-protobuf-InvalidProtocalBuffen-tp24186.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


intellij14 compiling spark-1.3.1 got error: assertion failed: com.google.protobuf.InvalidProtocalBufferException

2015-08-09 Thread longda...@163.com
hi all,
i compile spark-1.3.1 on linux use intellij14 and got error assertion
failed: com.google.protobuf.InvalidProtocalBufferException, how could i
solve the problem?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/intellij14-compiling-spark-1-3-1-got-error-assertion-failed-com-google-protobuf-InvalidProtocalBuffen-tp24186.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark 1.3.1 : unable to access s3n:// urls (no file system for scheme s3n:)

2015-07-22 Thread Eugene Morozov
Hi, 

I’m stuck with the same issue, but I see 
org.apache.hadoop.fs.s3native.NativeS3FileSystem in the hadoop-core:1.0.4 
(that’s the current hadoop-client I use) and this far is transitive dependency 
that comes from spark itself. I’m using custom build of spark 1.3.1 with 
hadoop-client 1.0.4. 

[INFO] +- 
org.apache.spark:spark-core_2.10:jar:1.3.1-hadoop-client-1.0.4:provided
...
[INFO] |  +- org.apache.hadoop:hadoop-client:jar:1.0.4:provided
[INFO] |  |  \- org.apache.hadoop:hadoop-core:jar:1.0.4:provided

The thing is I don’t have any direct usages of any hadoop-client version, so in 
my understanding I should be able to run my jar on any version of spark (1.3.1 
with hadoop-client 2.2.0 up to 2.2.6 or 1.3.1 with hadoop-client 1.0.4 up to 
1.2.1), but in reality, running it on a live cluster I’m getting class not 
found exception. I’ve checked über-jar of spark itself, and NativeS3FileSystem 
is there, so I don’t really understand where it comes from.

java.lang.RuntimeException: java.lang.ClassNotFoundException: Class 
org.apache.hadoop.fs.s3native.NativeS3FileSystem not found
at 
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2074)


I’ve just got an idea. Is it possible that Executor’s classpath is different 
from the Worker classpath? How can I check Executor’s classpath?

On 23 Apr 2015, at 17:35, Ted Yu  wrote:

> NativeS3FileSystem class is in hadoop-aws jar.
> Looks like it was not on classpath.
> 
> Cheers
> 
> On Thu, Apr 23, 2015 at 7:30 AM, Sujee Maniyam  wrote:
> Thanks all...
> 
> btw, s3n load works without any issues with  spark-1.3.1-bulit-for-hadoop 2.4 
> 
> I tried this on 1.3.1-hadoop26
> >  sc.hadoopConfiguration.set("fs.s3n.impl", 
> > "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
> > val f = sc.textFile("s3n://bucket/file")
> > f.count
> 
> No it can't find the implementation path.  Looks like some jar is missing ?
> 
> java.lang.RuntimeException: java.lang.ClassNotFoundException: Class 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem not found
>   at 
> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2074)
>   at 
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2578)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
> 
> On Wednesday, April 22, 2015, Shuai Zheng  wrote:
> Below is my code to access s3n without problem (only for 1.3.1. there is a 
> bug in 1.3.0).
> 
>  
> 
>   Configuration hadoopConf = ctx.hadoopConfiguration();
> 
>   hadoopConf.set("fs.s3n.impl", 
> "org.apache.hadoop.fs.s3native.NativeS3FileSystem");
> 
>   hadoopConf.set("fs.s3n.awsAccessKeyId", awsAccessKeyId);
> 
>   hadoopConf.set("fs.s3n.awsSecretAccessKey", awsSecretAccessKey);
> 
>  
> 
> Regards,
> 
>  
> 
> Shuai
> 
>  
> 
> From: Sujee Maniyam [mailto:su...@sujee.net] 
> Sent: Wednesday, April 22, 2015 12:45 PM
> To: Spark User List
> Subject: spark 1.3.1 : unable to access s3n:// urls (no file system for 
> scheme s3n:)
> 
>  
> 
> Hi all
> 
> I am unable to access s3n://  urls using   sc.textFile().. getting 'no file 
> system for scheme s3n://'  error.
> 
>  
> 
> a bug or some conf settings missing?
> 
>  
> 
> See below for details:
> 
>  
> 
> env variables : 
> 
> AWS_SECRET_ACCESS_KEY=set
> 
> AWS_ACCESS_KEY_ID=set
> 
>  
> 
> spark/RELAESE :
> 
> Spark 1.3.1 (git revision 908a0bf) built for Hadoop 2.6.0
> 
> Build flags: -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver 
> -Pyarn -DzincPort=3034
> 
>  
> 
>  
> 
> ./bin/spark-shell
> 
> > val f = sc.textFile("s3n://bucket/file")
> 
> > f.count
> 
>  
> 
> error==> 
> 
> java.io.IOException: No FileSystem for scheme: s3n
> 
> at 
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584)
> 
> at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
> 
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
> 
> at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
> 
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
> 
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
> 
> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
> 
> at 
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(Fi

databricks spark sql csv FAILFAST not failing, Spark 1.3.1 Java 7

2015-07-22 Thread Adam Pritchard
Hi all,

I am using the databricks csv library to load some data into a data frame.
https://github.com/databricks/spark-csv


I am trying to confirm that failfast mode works correctly and aborts
execution upon receiving an invalid csv file.  But have not been able to
see it fail yet after testing numerous invalid csv files.  Any advice?

spark 1.3.1 running on mapr vm 4.1.0 java 1.7


SparkConf conf = new SparkConf().setAppName("Dataframe testing");

JavaSparkContext sc = new JavaSparkContext(conf);


SQLContext sqlContext = new SQLContext(sc);
HashMap options = new HashMap();
options.put("header", "true");
options.put("path", args[0]);
options.put("mode", "FAILFAST");
//partner data
DataFrame partnerData = sqlContext.load("com.databricks.spark.csv", options
);
//register partnerData table in spark sql
partnerData.registerTempTable("partnerData");

partnerData.printSchema();
partnerData.show();


It just runs like normal, and will output the data, even with an invalid
csv file.


Thanks!


Re: Spark 1.3.1 + Hive: write output to CSV with header on S3

2015-07-17 Thread Michael Armbrust
Using a hive-site.xml file on the classpath.

On Fri, Jul 17, 2015 at 8:37 AM, spark user 
wrote:

> Hi Roberto
>
> I have question regarding HiveContext .
>
> when you create HiveContext where you define Hive connection properties ?
> Suppose Hive is not in local machine i need to connect , how HiveConext
> will know the data base info like url ,username and password ?
>
> String  username = "";
> String  password = "";
>
> String url = "jdbc:hive2://quickstart.cloudera:1/default";
>
>
>
>   On Friday, July 17, 2015 2:29 AM, Roberto Coluccio <
> roberto.coluc...@gmail.com> wrote:
>
>
> Hello community,
>
> I'm currently using Spark 1.3.1 with Hive support for outputting processed
> data on an external Hive table backed on S3. I'm using a manual
> specification of the delimiter, but I'd want to know if is there any
> "clean" way to write in CSV format:
>
> *val* sparkConf = *new* SparkConf()
> *val* sc = *new* SparkContext(sparkConf)
> *val* hiveContext = *new* org.apache.spark.sql.hive.HiveContext(sc)
> *import* hiveContext.implicits._
> hiveContext.sql( "CREATE EXTERNAL TABLE IF NOT EXISTS table_name(field1
> STRING, field2 STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
> LOCATION '" + path_on_s3 + "'")
> hiveContext.sql()
>
> I also need the header of the table to be printed on each written file. I
> tried with:
>
> hiveContext.sql("set hive.cli.print.header=true")
>
> But it didn't work.
>
> Any hint?
>
> Thank you.
>
> Best regards,
> Roberto
>
>
>
>


Re: Spark 1.3.1 + Hive: write output to CSV with header on S3

2015-07-17 Thread spark user
Hi Roberto 
I have question regarding HiveContext . 
when you create HiveContext where you define Hive connection properties ?  
Suppose Hive is not in local machine i need to connect , how HiveConext will 
know the data base info like url ,username and password ?
String  username = "";
String  password = "";
String url = "jdbc:hive2://quickstart.cloudera:1/default";  


 On Friday, July 17, 2015 2:29 AM, Roberto Coluccio 
 wrote:
   

 Hello community,
I'm currently using Spark 1.3.1 with Hive support for outputting processed data 
on an external Hive table backed on S3. I'm using a manual specification of the 
delimiter, but I'd want to know if is there any "clean" way to write in CSV 
format:
val sparkConf = new SparkConf()val sc = new SparkContext(sparkConf)val 
hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)import 
hiveContext.implicits._   hiveContext.sql( "CREATE EXTERNAL TABLE IF NOT EXISTS 
table_name(field1 STRING, field2 STRING) ROW FORMAT DELIMITED FIELDS TERMINATED 
BY ',' LOCATION '" + path_on_s3 + "'")hiveContext.sql()
I also need the header of the table to be printed on each written file. I tried 
with:
hiveContext.sql("set hive.cli.print.header=true")
But it didn't work.
Any hint?
Thank you.
Best regards,Roberto


  

Spark 1.3.1 + Hive: write output to CSV with header on S3

2015-07-17 Thread Roberto Coluccio
Hello community,

I'm currently using Spark 1.3.1 with Hive support for outputting processed
data on an external Hive table backed on S3. I'm using a manual
specification of the delimiter, but I'd want to know if is there any
"clean" way to write in CSV format:

*val* sparkConf = *new* SparkConf()

*val* sc = *new* SparkContext(sparkConf)

*val* hiveContext = *new* org.apache.spark.sql.hive.HiveContext(sc)

*import* hiveContext.implicits._

hiveContext.sql( "CREATE EXTERNAL TABLE IF NOT EXISTS table_name(field1
STRING, field2 STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '" + path_on_s3 + "'")

hiveContext.sql()


I also need the header of the table to be printed on each written file. I
tried with:


hiveContext.sql("set hive.cli.print.header=true")


But it didn't work.


Any hint?


Thank you.


Best regards,

Roberto


[Spark 1.3.1] Spark HiveQL -> CDH 5.3 Hive 0.13 UDF's

2015-06-26 Thread Mike Frampton
Hi 

I have a five node CDH 5.3 cluster running on CentOS 6.5, I also have a 
separate 
install of Spark 1.3.1. ( The CDH 5.3 install has Spark 1.2 but I wanted a 
newer version. )

I managed to write some Scala based code using a Hive Context to connect to 
Hive and 
create/populate  tables etc. I compiled my application using sbt and ran it 
with spark-submit
in local mode. 

My question concerns UDF's, specifically the function row_sequence function in 
the hive-contrib 
jar file i.e.  

hiveContext.sql("""

ADD JAR 
/opt/cloudera/parcels/CDH-5.3.3-1.cdh5.3.3.p0.5/jars/hive-contrib-0.13.1-cdh5.3.3.jar

  """)

hiveContext.sql("""

CREATE TEMPORARY FUNCTION row_sequence as 
'org.apache.hadoop.hive.contrib.udf.UDFRowSequence';

  """)


 val resRDD = hiveContext.sql("""

  SELECT row_sequence(),t1.edu FROM
( SELECT DISTINCT education AS edu FROM adult3 ) t1
  ORDER BY t1.edu

""")

This seems to generate its sequence in the map (?) phase of execution because 
no matter how I fiddle 
with the main SQL I could not get an ascending index for dimension data. i.e. I 
always get 

1  val1
1  val2
1  val3

instead of 

1  val1
2  val2
3  val3

Im well aware that I can play with scala and get around this issue and I have 
but I wondered whether others
have come across this and solved it ? 

cheers

Mike F

[Spark 1.3.1 SQL] Using Hive

2015-06-21 Thread Mike Frampton
Hi 

Is it true that if I want to use Spark SQL ( for Spark 1.3.1 ) against Apache 
Hive I need to build a source version of Spark ? 

Im using CDH 5.3 on CentOS Linux 6.5 which uses Hive 0.13.0 ( I think ). 

cheers

Mike F

  

Re: [Spark 1.3.1 on YARN on EMR] Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

2015-06-20 Thread Roberto Coluccio
I confirm,

Christopher was very kind helping me out here. The solution presented in the 
linked doc worked perfectly. IMO it should be linked in the official Spark 
documentation.

Thanks again,

Roberto


> On 20 Jun 2015, at 19:25, Bozeman, Christopher  wrote:
> 
> We worked it out.  There was multiple items (like location of remote 
> metastore and db user auth) to make HiveContext happy in yarn-cluster mode. 
> 
> For reference 
> https://github.com/awslabs/emr-bootstrap-actions/blob/master/spark/examples/using-hivecontext-yarn-cluster.md
> 
> -Christopher Bozeman 
> 
> 
> On Jun 20, 2015, at 7:24 AM, Andrew Lee  wrote:
> 
>> Hi Roberto,
>> 
>> I'm not an EMR person, but it looks like option -h is deploying the 
>> necessary dataneucleus JARs for you.
>> The req for HiveContext is the hive-site.xml and dataneucleus JARs. As long 
>> as these 2 are there, and Spark is compiled with -Phive, it should work.
>> 
>> spark-shell runs in yarn-client mode. Not sure whether your other 
>> application is running under the same mode or a different one. Try 
>> specifying yarn-client mode and see if you get the same result as 
>> spark-shell.
>> 
>> From: roberto.coluc...@gmail.com
>> Date: Wed, 10 Jun 2015 14:32:04 +0200
>> Subject: [Spark 1.3.1 on YARN on EMR] Unable to instantiate 
>> org.apache.hadoop.hive.metastore.HiveMetaStoreClient
>> To: user@spark.apache.org
>> 
>> Hi!
>> 
>> I'm struggling with an issue with Spark 1.3.1 running on YARN, running on an 
>> AWS EMR cluster. Such cluster is based on AMI 3.7.0 (hence Amazon Linux 
>> 2015.03, Hive 0.13 already installed and configured on the cluster, Hadoop 
>> 2.4, etc...). I make use of the AWS emr-bootstrap-action "install-spark" 
>> (https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark) with 
>> the option/version "-v1.3.1e" so to get the latest Spark for EMR installed 
>> and available.
>> 
>> I also have a simple Spark Streaming driver in my project. Such driver is 
>> part of a larger Maven project: in the pom.xml I'm currently using   
>> 
>> [...]
>> 
>> 
>> 
>> 2.10
>> 
>> 2.10.4
>> 
>> 1.7
>> 
>> 1.3.1
>> 
>> 2.4.1
>> 
>> 
>> 
>> []
>> 
>> 
>> 
>> 
>>   org.apache.spark
>> 
>>   spark-streaming_${scala.binary.version}
>> 
>>   ${spark.version}
>> 
>>   provided
>> 
>>   
>> 
>> 
>> 
>>   org.apache.hadoop
>> 
>>   hadoop-client
>> 
>> 
>> 
>>   
>> 
>> 
>> 
>> 
>> 
>> 
>>   org.apache.hadoop
>> 
>>   hadoop-client
>> 
>>   ${hadoop.version}
>> 
>>   provided
>> 
>> 
>> 
>> 
>> 
>> 
>>   org.apache.spark
>> 
>>   spark-hive_${scala.binary.version}
>> 
>>   ${spark.version}
>> 
>>   provided
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> In fact, at compile and build time everything works just fine if, in my 
>> driver, I have:
>> 
>> 
>> 
>> -
>> 
>> 
>> 
>> val sparkConf = new SparkConf()
>> 
>>   .setAppName(appName)
>> 
>>   .set("spark.local.dir", "/tmp/" + appName)
>> 
>>   .set("spark.streaming.unpersist", "true")
>> 
>>   .set("spark.serializer", 
>> "org.apache.spark.serializer.KryoSerializer")
>> 
>>   .registerKryoClasses(Array(classOf[java.net.URI], classOf[String]))
>> 
>> 
>> val sc = new SparkContext(sparkConf)
>> 
>> 
>> val ssc = new StreamingContext(sc, config.batchDuration)
>> import org.apache.spark.streaming.StreamingContext._
>> 
>> ssc.checkpoint(sparkConf.get("spark.local.dir") + checkpointRelativeDir)
>> 
>> 
>> < some input reading actions >
>> 
>> 
>> < some input transformation actions >
>> 
>> 
>> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
>> 
>> import sqlContext.implicits._
>> 
>> sqlContext.sql()
>> 
>> 
>> ssc.start()
>> 
>> ssc.awaitTerminationOrTimeout(config.timeout)
>> 
>> 
>> 
>> --- 
>> 
>> 
>> 
>>

Re: [Spark 1.3.1 on YARN on EMR] Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

2015-06-20 Thread Bozeman, Christopher
We worked it out.  There was multiple items (like location of remote metastore 
and db user auth) to make HiveContext happy in yarn-cluster mode.

For reference 
https://github.com/awslabs/emr-bootstrap-actions/blob/master/spark/examples/using-hivecontext-yarn-cluster.md

-Christopher Bozeman


On Jun 20, 2015, at 7:24 AM, Andrew Lee 
mailto:alee...@hotmail.com>> wrote:

Hi Roberto,

I'm not an EMR person, but it looks like option -h is deploying the necessary 
dataneucleus JARs for you.
The req for HiveContext is the hive-site.xml and dataneucleus JARs. As long as 
these 2 are there, and Spark is compiled with -Phive, it should work.

spark-shell runs in yarn-client mode. Not sure whether your other application 
is running under the same mode or a different one. Try specifying yarn-client 
mode and see if you get the same result as spark-shell.


From: roberto.coluc...@gmail.com<mailto:roberto.coluc...@gmail.com>
Date: Wed, 10 Jun 2015 14:32:04 +0200
Subject: [Spark 1.3.1 on YARN on EMR] Unable to instantiate 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient
To: user@spark.apache.org<mailto:user@spark.apache.org>

Hi!

I'm struggling with an issue with Spark 1.3.1 running on YARN, running on an 
AWS EMR cluster. Such cluster is based on AMI 3.7.0 (hence Amazon Linux 
2015.03, Hive 0.13 already installed and configured on the cluster, Hadoop 2.4, 
etc...). I make use of the AWS emr-bootstrap-action "install-spark" 
(https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark) with the 
option/version "-v1.3.1e" so to get the latest Spark for EMR installed and 
available.

I also have a simple Spark Streaming driver in my project. Such driver is part 
of a larger Maven project: in the pom.xml I'm currently using


[...]


2.10

2.10.4

1.7

1.3.1

2.4.1


[]




  org.apache.spark

  spark-streaming_${scala.binary.version}

  ${spark.version}

  provided

  



  org.apache.hadoop

  hadoop-client



  






  org.apache.hadoop

  hadoop-client

  ${hadoop.version}

  provided






  org.apache.spark

  spark-hive_${scala.binary.version}

  ${spark.version}

  provided





In fact, at compile and build time everything works just fine if, in my driver, 
I have:


-


val sparkConf = new SparkConf()

  .setAppName(appName)

  .set("spark.local.dir", "/tmp/" + appName)

  .set("spark.streaming.unpersist", "true")

  .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

  .registerKryoClasses(Array(classOf[java.net.URI], classOf[String]))


val sc = new SparkContext(sparkConf)


val ssc = new StreamingContext(sc, config.batchDuration)

import org.apache.spark.streaming.StreamingContext._

ssc.checkpoint(sparkConf.get("spark.local.dir") + checkpointRelativeDir)


< some input reading actions >


< some input transformation actions >


val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)

import sqlContext.implicits._

sqlContext.sql()


ssc.start()

ssc.awaitTerminationOrTimeout(config.timeout)



---


What happens is that, right after have been launched, the driver fails with the 
exception:


15/06/10 11:38:18 ERROR yarn.ApplicationMaster: User class threw exception: 
java.lang.RuntimeException: Unable to instantiate 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient
java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient
at 
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:346)
at 
org.apache.spark.sql.hive.HiveContext.sessionState$lzycompute(HiveContext.scala:239)
at org.apache.spark.sql.hive.HiveContext.sessionState(HiveContext.scala:235)
at 
org.apache.spark.sql.hive.HiveContext.hiveconf$lzycompute(HiveContext.scala:251)
at org.apache.spark.sql.hive.HiveContext.hiveconf(HiveContext.scala:250)
at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:95)
at  myDriver.scala: < line of the sqlContext.sql(query) >
Caused by < some stuff >
Caused by: javax.jdo.JDOFatalUserException: Class 
org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found.
NestedThrowables:
java.lang.ClassNotFoundException: 
org.datanucleus.api.jdo.JDOPersistenceManagerFactory
...
Caused by: java.lang.ClassNotFoundException: 
org.datanucleus.api.jdo.JDOPersistenceManagerFactory


Thinking about a wrong Hive installation/configuration or libs/classpath 
definition, I SSHed into the cluster and launched a spark-shell. Excluding the 
app configuration and StreamingContext usage/definition, I then carried out all 
the actions listed in the driver implementation, in particul

RE: [Spark 1.3.1 on YARN on EMR] Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

2015-06-20 Thread Andrew Lee
Hi Roberto,
I'm not an EMR person, but it looks like option -h is deploying the necessary 
dataneucleus JARs for you.The req for HiveContext is the hive-site.xml and 
dataneucleus JARs. As long as these 2 are there, and Spark is compiled with 
-Phive, it should work.
spark-shell runs in yarn-client mode. Not sure whether your other application 
is running under the same mode or a different one. Try specifying yarn-client 
mode and see if you get the same result as spark-shell.
From: roberto.coluc...@gmail.com
Date: Wed, 10 Jun 2015 14:32:04 +0200
Subject: [Spark 1.3.1 on YARN on EMR] Unable to instantiate 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient
To: user@spark.apache.org

Hi!
I'm struggling with an issue with Spark 1.3.1 running on YARN, running on an 
AWS EMR cluster. Such cluster is based on AMI 3.7.0 (hence Amazon Linux 
2015.03, Hive 0.13 already installed and configured on the cluster, Hadoop 2.4, 
etc...). I make use of the AWS emr-bootstrap-action "install-spark" 
(https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark) with the 
option/version "-v1.3.1e" so to get the latest Spark for EMR installed and 
available.
I also have a simple Spark Streaming driver in my project. Such driver is part 
of a larger Maven project: in the pom.xml I'm currently using   
[...]
2.10
2.10.4
1.7
1.3.1
2.4.1
[]

  org.apache.spark
  spark-streaming_${scala.binary.version}
  ${spark.version}
  provided
  

  org.apache.hadoop
  hadoop-client

  




  org.apache.hadoop
  hadoop-client
  ${hadoop.version}
  provided




  org.apache.spark
  spark-hive_${scala.binary.version}
  ${spark.version}
  provided


In fact, at compile and build time everything works just fine if, in my driver, 
I have:
-
val sparkConf = new SparkConf()  .setAppName(appName)  
.set("spark.local.dir", "/tmp/" + appName)  
.set("spark.streaming.unpersist", "true")  .set("spark.serializer", 
"org.apache.spark.serializer.KryoSerializer")  
.registerKryoClasses(Array(classOf[java.net.URI], classOf[String]))
val sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sc, config.batchDuration)
import org.apache.spark.streaming.StreamingContext._












ssc.checkpoint(sparkConf.get("spark.local.dir") + checkpointRelativeDir)
< some input reading actions >
< some input transformation actions >
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
import sqlContext.implicits._sqlContext.sql()
ssc.start()ssc.awaitTerminationOrTimeout(config.timeout)

--- 
What happens is that, right after have been launched, the driver fails with the 
exception:
15/06/10 11:38:18 ERROR yarn.ApplicationMaster: User class threw exception: 
java.lang.RuntimeException: Unable to instantiate 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient
java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient
at 
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:346)
at 
org.apache.spark.sql.hive.HiveContext.sessionState$lzycompute(HiveContext.scala:239)
at org.apache.spark.sql.hive.HiveContext.sessionState(HiveContext.scala:235)
at 
org.apache.spark.sql.hive.HiveContext.hiveconf$lzycompute(HiveContext.scala:251)
at org.apache.spark.sql.hive.HiveContext.hiveconf(HiveContext.scala:250)
at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:95)
at  myDriver.scala: < line of the sqlContext.sql(query) >
Caused by < some stuff >
Caused by: javax.jdo.JDOFatalUserException: Class 
org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found.
NestedThrowables:
java.lang.ClassNotFoundException: 
org.datanucleus.api.jdo.JDOPersistenceManagerFactory
...
Caused by: java.lang.ClassNotFoundException: 
org.datanucleus.api.jdo.JDOPersistenceManagerFactory
Thinking about a wrong Hive installation/configuration or libs/classpath 
definition, I SSHed into the cluster and launched a spark-shell. Excluding the 
app configuration and StreamingContext usage/definition, I then carried out all 
the actions listed in the driver implementation, in particular all the 
Hive-related ones and they all went through smoothly!

I also tried to use the optional "-h" argument 
(https://github.com/awslabs/emr-bootstrap-actions/blob/master/spark/README.md#arguments-optional)
 in the install-spark emr-bootstrap-action, but the driver failed the very same 
way. Furthermore, when launching a spark-shell (on the EMR cluster with Spark 
installed with the "-h" option), I also got:
15/06/09 14:20:51 WARN conf.HiveConf: hive-default.xml not found on CLASSPATH
15/06/09 14:20:52 INFO metastore.Hi

Spark SQL DATE_ADD function - Spark 1.3.1 & 1.4.0

2015-06-17 Thread Nathan McCarthy
Hi guys,

Running with a parquet backed table in hive ‘dim_promo_date_curr_p' which has 
the following data;

scala> sqlContext.sql("select * from pz.dim_promo_date_curr_p").show(3)
15/06/18 00:53:21 INFO ParseDriver: Parsing command: select * from 
pz.dim_promo_date_curr_p
15/06/18 00:53:21 INFO ParseDriver: Parse Completed
+--+-+---+
|clndr_date|pw_start_date|pw_end_date|
+--+-+---+
|2015-02-18|   2015-02-18| 2015-02-24|
|2015-11-13|   2015-11-11| 2015-11-17|
|2015-03-31|   2015-03-25| 2015-03-31|
|2015-07-21|   2015-07-15| 2015-07-21|
+--+-+---+

Running a query from Spark 1.4 shell with the sqlContext (hive) with date_add 
it seems to work except for the value from the table. I’ve only seen it on the 
31st of March, no other dates;

scala> sqlContext.sql("SELECT DATE_ADD(CLNDR_DATE, 7) as wrong, 
DATE_ADD('2015-03-30', 7) as right30, DATE_ADD('2015-03-31', 7) as right31, 
DATE_ADD('2015-04-01', 7) as right01 FROM pz.dim_promo_date_curr_p WHERE 
CLNDR_DATE='2015-03-31'").show

15/06/18 00:57:32 INFO ParseDriver: Parsing command: SELECT 
DATE_ADD(CLNDR_DATE, 7) as wrong, DATE_ADD('2015-03-30', 7) as right30, 
DATE_ADD('2015-03-31', 7) as right31, DATE_ADD('2015-04-01', 7) as right01 FROM 
pz.dim_promo_date_curr_p WHERE CLNDR_DATE='2015-03-31'
15/06/18 00:57:32 INFO ParseDriver: Parse Completed
+--+--+--+--+
| wrong|   right30|   right31|   right01|
+--+--+--+--+
|2015-04-06|2015-04-06|2015-04-07|2015-04-08|
+--+--+--+--+

It seems to miss a date, even though the where clause has 31st in it. When the 
date is just a string the select clause seems to work fine. Problem appears in 
Spark 1.3.1 as well.

Not sure if this is coming from Hive, but it seems like a bug. I’ve raised a 
JIRA https://issues.apache.org/jira/browse/SPARK-8421

Cheers,
Nathan




Re: Not getting event logs >= spark 1.3.1

2015-06-16 Thread Tsai Li Ming
Forgot to mention this is on standalone mode.

Is my configuration wrong?

Thanks,
Liming

On 15 Jun, 2015, at 11:26 pm, Tsai Li Ming  wrote:

> Hi,
> 
> I have this in my spark-defaults.conf (same for hdfs):
> spark.eventLog.enabled  true
> spark.eventLog.dir  file:/tmp/spark-events
> spark.history.fs.logDirectory   file:/tmp/spark-events
> 
> While the app is running, there is a “.inprogress” directory. However when 
> the job completes, the directory is always empty.
> 
> I’m submitting the job like this, using either the Pi or world count examples:
> $ bin/spark-submit 
> /opt/spark-1.4.0-bin-hadoop2.6/examples/src/main/python/wordcount.py 
> 
> This used to be working in 1.2.1 and didn’t test 1.3.0.
> 
> 
> Regards,
> Liming
> 
> 
> 
> 
> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Not getting event logs >= spark 1.3.1

2015-06-15 Thread Tsai Li Ming
Hi,

I have this in my spark-defaults.conf (same for hdfs):
spark.eventLog.enabled  true
spark.eventLog.dir  file:/tmp/spark-events
spark.history.fs.logDirectory   file:/tmp/spark-events

While the app is running, there is a “.inprogress” directory. However when the 
job completes, the directory is always empty.

I’m submitting the job like this, using either the Pi or world count examples:
$ bin/spark-submit 
/opt/spark-1.4.0-bin-hadoop2.6/examples/src/main/python/wordcount.py 

This used to be working in 1.2.1 and didn’t test 1.3.0.


Regards,
Liming






-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Apache Phoenix (4.3.1 and 4.4.0-HBase-0.98) on Spark 1.3.1 ClassNotFoundException

2015-06-11 Thread Josh Mahonin
Hi Jeroen,

No problem. I think there's some magic involved with how the Spark
classloader(s) works, especially with regards to the HBase dependencies. I
know there's probably a more light-weight solution that doesn't require
customizing the Spark setup, but that's the most straight-forward way I've
found that works.

Looking again at the docs, I thought I had a PR that mentioned the
SPARK_CLASSPATH, but either I'm dreaming it or it got dropped on the floor.
I'll search around for it today.

Thanks for the StackOverflow heads up, but feel free to update your post
with the resolution, maybe with a GMane link to the thread?

Good luck,

Josh

On Thu, Jun 11, 2015 at 2:38 AM, Jeroen Vlek  wrote:

> Hi Josh,
>
> That worked! Thank you so much! (I can't believe it was something so
> obvious
> ;) )
>
> If you care about such a thing you could answer my question here for
> bounty:
>
> http://stackoverflow.com/questions/30639659/apache-phoenix-4-3-1-and-4-4-0-hbase-0-98-on-spark-1-3-1-classnotfoundexceptio
>
> Have a great day!
>
> Cheers,
> Jeroen
>
> On Wednesday 10 June 2015 08:58:02 Josh Mahonin wrote:
> > Hi Jeroen,
> >
> > Rather than bundle the Phoenix client JAR with your app, are you able to
> > include it in a static location either in the SPARK_CLASSPATH, or set the
> > conf values below (I use SPARK_CLASSPATH myself, though it's deprecated):
> >
> >   spark.driver.extraClassPath
> >   spark.executor.extraClassPath
> >
> > Josh
> >
> > On Wed, Jun 10, 2015 at 4:11 AM, Jeroen Vlek 
> wrote:
> > > Hi Josh,
> > >
> > > Thank you for your effort. Looking at your code, I feel that mine is
> > > semantically the same, except written in Java. The dependencies in the
> > > pom.xml
> > > all have the scope provided. The job is submitted as follows:
> > >
> > > $ rm spark.log && MASTER=spark://maprdemo:7077
> > > /opt/mapr/spark/spark-1.3.1/bin/spark-submit-jars
> > > /home/mapr/projects/customer/lib/spark-streaming-
> > >
> > >
> kafka_2.10-1.3.1.jar,/home/mapr/projects/customer/lib/kafka_2.10-0.8.1.1.j
> > >
> ar,/home/mapr/projects/customer/lib/zkclient-0.3.jar,/home/mapr/projects/c
> > > ustomer/lib/metrics-
> > > core-3.1.0.jar,/home/mapr/projects/customer/lib/metrics-
> > >
> > >
> core-2.2.0.jar,lib/spark-sql_2.10-1.3.1.jar,/opt/mapr/phoenix/phoenix-4.4.
> > > 0- HBase-0.98-bin/phoenix-4.4.0-HBase-0.98-client.jar --class
> > > nl.work.kafkastreamconsumer.phoenix.KafkaPhoenixConnector
> > > KafkaStreamConsumer.jar maprdemo:5181 0 topic
> jdbc:phoenix:maprdemo:5181
> > > true
> > >
> > > The spark-defaults.conf is reverted back to its defaults (i.e. no
> > > userClassPathFirst). In the catch-block of the Phoenix connection
> buildup
> > > the
> > > class path is printed by recursively iterating over the class loaders.
> The
> > > first one already prints the phoenix-client jar [1]. It's also very
> > > unlikely to
> > > be a bug in Spark or Phoenix, if your proof-of-concept just works.
> > >
> > > So if the JAR that contains the offending class is known by the class
> > > loader,
> > > then that might indicate that there's a second JAR providing the same
> > > class
> > > but with a different version, right?
> > > Yet, the only Phoenix JAR on the whole class path hierarchy is the
> > > aforementioned phoenix-client JAR. Furthermore, I googled the class in
> > > question, ClientRpcControllerFactory, and it really only exists in the
> > > Phoenix
> > > project. We're not talking about some low-level AOP Alliance stuff
> here ;)
> > >
> > > Maybe I'm missing some fundamental class loading knowledge, in that
> case
> > > I'd
> > > be very happy to be enlightened. This all seems very strange.
> > >
> > > Cheers,
> > > Jeroen
> > >
> > > [1]
> > >
> [file:/opt/mapr/spark/spark-1.3.1/tmp/app-20150610010512-0001/0/./spark-
> > > streaming-kafka_2.10-1.3.1.jar,
> > >
> > >
> file:/opt/mapr/spark/spark-1.3.1/tmp/app-20150610010512-0001/0/./kafka_2.1
> > > 0-0.8.1.1.jar,
> > >
> > >
> file:/opt/mapr/spark/spark-1.3.1/tmp/app-20150610010512-0001/0/./zkclient-
> > > 0.3.jar,
> > >
> > >
> file:/opt/mapr/spark/spark-1.3.1/tmp/app-20150610010512-0001/0/./phoenix-4
> > > .4.0- HBase-0.98-client.jar,
> > > file:/opt/mapr/spark/spark-1.3.1/tmp/app-20

Re: Apache Phoenix (4.3.1 and 4.4.0-HBase-0.98) on Spark 1.3.1 ClassNotFoundException

2015-06-10 Thread Jeroen Vlek
Hi Josh,

That worked! Thank you so much! (I can't believe it was something so obvious 
;) )

If you care about such a thing you could answer my question here for bounty: 
http://stackoverflow.com/questions/30639659/apache-phoenix-4-3-1-and-4-4-0-hbase-0-98-on-spark-1-3-1-classnotfoundexceptio

Have a great day!

Cheers,
Jeroen

On Wednesday 10 June 2015 08:58:02 Josh Mahonin wrote:
> Hi Jeroen,
> 
> Rather than bundle the Phoenix client JAR with your app, are you able to
> include it in a static location either in the SPARK_CLASSPATH, or set the
> conf values below (I use SPARK_CLASSPATH myself, though it's deprecated):
> 
>   spark.driver.extraClassPath
>   spark.executor.extraClassPath
> 
> Josh
> 
> On Wed, Jun 10, 2015 at 4:11 AM, Jeroen Vlek  wrote:
> > Hi Josh,
> > 
> > Thank you for your effort. Looking at your code, I feel that mine is
> > semantically the same, except written in Java. The dependencies in the
> > pom.xml
> > all have the scope provided. The job is submitted as follows:
> > 
> > $ rm spark.log && MASTER=spark://maprdemo:7077
> > /opt/mapr/spark/spark-1.3.1/bin/spark-submit-jars
> > /home/mapr/projects/customer/lib/spark-streaming-
> > 
> > kafka_2.10-1.3.1.jar,/home/mapr/projects/customer/lib/kafka_2.10-0.8.1.1.j
> > ar,/home/mapr/projects/customer/lib/zkclient-0.3.jar,/home/mapr/projects/c
> > ustomer/lib/metrics-
> > core-3.1.0.jar,/home/mapr/projects/customer/lib/metrics-
> > 
> > core-2.2.0.jar,lib/spark-sql_2.10-1.3.1.jar,/opt/mapr/phoenix/phoenix-4.4.
> > 0- HBase-0.98-bin/phoenix-4.4.0-HBase-0.98-client.jar --class
> > nl.work.kafkastreamconsumer.phoenix.KafkaPhoenixConnector
> > KafkaStreamConsumer.jar maprdemo:5181 0 topic jdbc:phoenix:maprdemo:5181
> > true
> > 
> > The spark-defaults.conf is reverted back to its defaults (i.e. no
> > userClassPathFirst). In the catch-block of the Phoenix connection buildup
> > the
> > class path is printed by recursively iterating over the class loaders. The
> > first one already prints the phoenix-client jar [1]. It's also very
> > unlikely to
> > be a bug in Spark or Phoenix, if your proof-of-concept just works.
> > 
> > So if the JAR that contains the offending class is known by the class
> > loader,
> > then that might indicate that there's a second JAR providing the same
> > class
> > but with a different version, right?
> > Yet, the only Phoenix JAR on the whole class path hierarchy is the
> > aforementioned phoenix-client JAR. Furthermore, I googled the class in
> > question, ClientRpcControllerFactory, and it really only exists in the
> > Phoenix
> > project. We're not talking about some low-level AOP Alliance stuff here ;)
> > 
> > Maybe I'm missing some fundamental class loading knowledge, in that case
> > I'd
> > be very happy to be enlightened. This all seems very strange.
> > 
> > Cheers,
> > Jeroen
> > 
> > [1]
> > [file:/opt/mapr/spark/spark-1.3.1/tmp/app-20150610010512-0001/0/./spark-
> > streaming-kafka_2.10-1.3.1.jar,
> > 
> > file:/opt/mapr/spark/spark-1.3.1/tmp/app-20150610010512-0001/0/./kafka_2.1
> > 0-0.8.1.1.jar,
> > 
> > file:/opt/mapr/spark/spark-1.3.1/tmp/app-20150610010512-0001/0/./zkclient-
> > 0.3.jar,
> > 
> > file:/opt/mapr/spark/spark-1.3.1/tmp/app-20150610010512-0001/0/./phoenix-4
> > .4.0- HBase-0.98-client.jar,
> > file:/opt/mapr/spark/spark-1.3.1/tmp/app-20150610010512-0001/0/./spark-
> > sql_2.10-1.3.1.jar,
> > file:/opt/mapr/spark/spark-1.3.1/tmp/app-20150610010512-0001/0/./metrics-
> > core-3.1.0.jar,
> > 
> > file:/opt/mapr/spark/spark-1.3.1/tmp/app-20150610010512-0001/0/./KafkaStre
> > amConsumer.jar,
> > file:/opt/mapr/spark/spark-1.3.1/tmp/app-20150610010512-0001/0/./metrics-
> > core-2.2.0.jar]
> > 
> > On Tuesday, June 09, 2015 11:18:08 AM Josh Mahonin wrote:
> > > This may or may not be helpful for your classpath issues, but I wanted
> > > to
> > > verify that basic functionality worked, so I made a sample app here:
> > > 
> > > https://github.com/jmahonin/spark-streaming-phoenix
> > > 
> > > This consumes events off a Kafka topic using spark streaming, and writes
> > > out event counts to Phoenix using the new phoenix-spark functionality:
> > > http://phoenix.apache.org/phoenix_spark.html
> > > 
> > > It's definitely overkill, and would probably be more efficient to use
> > > the
> > > JDBC driver directly, but it serves as a proof-of-concept.
> > > 
> > >

Re: Apache Phoenix (4.3.1 and 4.4.0-HBase-0.98) on Spark 1.3.1 ClassNotFoundException

2015-06-10 Thread Josh Mahonin
Hi Jeroen,

Rather than bundle the Phoenix client JAR with your app, are you able to
include it in a static location either in the SPARK_CLASSPATH, or set the
conf values below (I use SPARK_CLASSPATH myself, though it's deprecated):

  spark.driver.extraClassPath
  spark.executor.extraClassPath

Josh

On Wed, Jun 10, 2015 at 4:11 AM, Jeroen Vlek  wrote:

> Hi Josh,
>
> Thank you for your effort. Looking at your code, I feel that mine is
> semantically the same, except written in Java. The dependencies in the
> pom.xml
> all have the scope provided. The job is submitted as follows:
>
> $ rm spark.log && MASTER=spark://maprdemo:7077
> /opt/mapr/spark/spark-1.3.1/bin/spark-submit-jars
> /home/mapr/projects/customer/lib/spark-streaming-
>
> kafka_2.10-1.3.1.jar,/home/mapr/projects/customer/lib/kafka_2.10-0.8.1.1.jar,/home/mapr/projects/customer/lib/zkclient-0.3.jar,/home/mapr/projects/customer/lib/metrics-
> core-3.1.0.jar,/home/mapr/projects/customer/lib/metrics-
>
> core-2.2.0.jar,lib/spark-sql_2.10-1.3.1.jar,/opt/mapr/phoenix/phoenix-4.4.0-
> HBase-0.98-bin/phoenix-4.4.0-HBase-0.98-client.jar --class
> nl.work.kafkastreamconsumer.phoenix.KafkaPhoenixConnector
> KafkaStreamConsumer.jar maprdemo:5181 0 topic jdbc:phoenix:maprdemo:5181
> true
>
> The spark-defaults.conf is reverted back to its defaults (i.e. no
> userClassPathFirst). In the catch-block of the Phoenix connection buildup
> the
> class path is printed by recursively iterating over the class loaders. The
> first one already prints the phoenix-client jar [1]. It's also very
> unlikely to
> be a bug in Spark or Phoenix, if your proof-of-concept just works.
>
> So if the JAR that contains the offending class is known by the class
> loader,
> then that might indicate that there's a second JAR providing the same class
> but with a different version, right?
> Yet, the only Phoenix JAR on the whole class path hierarchy is the
> aforementioned phoenix-client JAR. Furthermore, I googled the class in
> question, ClientRpcControllerFactory, and it really only exists in the
> Phoenix
> project. We're not talking about some low-level AOP Alliance stuff here ;)
>
> Maybe I'm missing some fundamental class loading knowledge, in that case
> I'd
> be very happy to be enlightened. This all seems very strange.
>
> Cheers,
> Jeroen
>
> [1]
> [file:/opt/mapr/spark/spark-1.3.1/tmp/app-20150610010512-0001/0/./spark-
> streaming-kafka_2.10-1.3.1.jar,
>
> file:/opt/mapr/spark/spark-1.3.1/tmp/app-20150610010512-0001/0/./kafka_2.10-0.8.1.1.jar,
>
> file:/opt/mapr/spark/spark-1.3.1/tmp/app-20150610010512-0001/0/./zkclient-0.3.jar,
>
> file:/opt/mapr/spark/spark-1.3.1/tmp/app-20150610010512-0001/0/./phoenix-4.4.0-
> HBase-0.98-client.jar,
> file:/opt/mapr/spark/spark-1.3.1/tmp/app-20150610010512-0001/0/./spark-
> sql_2.10-1.3.1.jar,
> file:/opt/mapr/spark/spark-1.3.1/tmp/app-20150610010512-0001/0/./metrics-
> core-3.1.0.jar,
>
> file:/opt/mapr/spark/spark-1.3.1/tmp/app-20150610010512-0001/0/./KafkaStreamConsumer.jar,
> file:/opt/mapr/spark/spark-1.3.1/tmp/app-20150610010512-0001/0/./metrics-
> core-2.2.0.jar]
>
>
> On Tuesday, June 09, 2015 11:18:08 AM Josh Mahonin wrote:
> > This may or may not be helpful for your classpath issues, but I wanted to
> > verify that basic functionality worked, so I made a sample app here:
> >
> > https://github.com/jmahonin/spark-streaming-phoenix
> >
> > This consumes events off a Kafka topic using spark streaming, and writes
> > out event counts to Phoenix using the new phoenix-spark functionality:
> > http://phoenix.apache.org/phoenix_spark.html
> >
> > It's definitely overkill, and would probably be more efficient to use the
> > JDBC driver directly, but it serves as a proof-of-concept.
> >
> > I've only tested this in local mode. To convert it to a full jobs JAR, I
> > suspect that keeping all of the spark and phoenix dependencies marked as
> > 'provided', and including the Phoenix client JAR in the Spark classpath
> > would work as well.
> >
> > Good luck,
> >
> > Josh
> >
> > On Tue, Jun 9, 2015 at 4:40 AM, Jeroen Vlek  wrote:
> > > Hi,
> > >
> > > I posted a question with regards to Phoenix and Spark Streaming on
> > > StackOverflow [1]. Please find a copy of the question to this email
> below
> > > the
> > > first stack trace. I also already contacted the Phoenix mailing list
> and
> > > tried
> > > the suggestion of setting spark.driver.userClassPathFirst.
> Unfortunately
> > > that
> > > only pushed me further into the dependency hell, which I trie

[Spark 1.3.1 on YARN on EMR] Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

2015-06-10 Thread Roberto Coluccio
Hi!

I'm struggling with an issue with Spark 1.3.1 running on YARN, running on
an AWS EMR cluster. Such cluster is based on AMI 3.7.0 (hence Amazon Linux
2015.03, Hive 0.13 already installed and configured on the cluster, Hadoop
2.4, etc...). I make use of the AWS emr-bootstrap-action "*install-spark*" (
https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark) with
the option/version* "-v1.3.1e"* so to get the latest Spark for EMR
installed and available.

I also have a simple Spark Streaming driver in my project. Such driver is
part of a larger Maven project: in the *pom.xml* I'm currently using

[...]


2.10

2.10.4

1.7

1.3.1

2.4.1


[]



  org.apache.spark

  spark-streaming_${scala.binary.version}

  ${spark.version}

  provided

  



  org.apache.hadoop

  hadoop-client



  






  org.apache.hadoop

  hadoop-client

  ${hadoop.version}

  provided






  org.apache.spark

  spark-hive_${scala.binary.version}

  ${spark.version}

  provided





In fact, at compile and build time everything works just fine if, in my
driver, I have:


-


*val* sparkConf = *new* SparkConf()

  .setAppName(appName)

  .set("spark.local.dir", "/tmp/" + appName)

  .set("spark.streaming.unpersist", "true")

  .set("spark.serializer",
"org.apache.spark.serializer.KryoSerializer")

  .registerKryoClasses(Array(classOf[java.net.URI],
classOf[String]))


*val* sc = *new* SparkContext(sparkConf)


*val* ssc = *new* StreamingContext(sc, config.batchDuration)

*import* org.apache.spark.streaming.StreamingContext._

ssc.checkpoint(sparkConf.get("spark.local.dir") + checkpointRelativeDir)


< some input reading actions >


< some input transformation actions >


*val* sqlContext = *new* org.apache.spark.sql.hive.HiveContext(sc)

*import* sqlContext.implicits._

sqlContext.sql()


ssc.start()

ssc.awaitTerminationOrTimeout(config.timeout)



---


What happens is that, right after have been launched, the driver fails with
the exception:


15/06/10 11:38:18 ERROR yarn.ApplicationMaster: User class threw
exception: java.lang.RuntimeException: Unable to instantiate
org.apache.hadoop.hive.metastore.HiveMetaStoreClient
java.lang.RuntimeException: java.lang.RuntimeException: Unable to
instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
at 
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:346)
at 
org.apache.spark.sql.hive.HiveContext.sessionState$lzycompute(HiveContext.scala:239)
at org.apache.spark.sql.hive.HiveContext.sessionState(HiveContext.scala:235)
at 
org.apache.spark.sql.hive.HiveContext.hiveconf$lzycompute(HiveContext.scala:251)
at org.apache.spark.sql.hive.HiveContext.hiveconf(HiveContext.scala:250)
at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:95)
at  myDriver.scala: < line of the sqlContext.sql(query) >
Caused by < some stuff >
Caused by: javax.jdo.JDOFatalUserException: Class
org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found.
NestedThrowables:
java.lang.ClassNotFoundException:
org.datanucleus.api.jdo.JDOPersistenceManagerFactory
...
Caused by: java.lang.ClassNotFoundException:
org.datanucleus.api.jdo.JDOPersistenceManagerFactory


Thinking about a wrong Hive installation/configuration or libs/classpath
definition, I SSHed into the cluster and launched a *spark-shell.*
Excluding the app configuration and StreamingContext usage/definition, I
then carried out all the actions listed in the driver implementation, in
particular all the Hive-related ones and they all went through smoothly!


I also tried to use the optional *"-h"* argument (
https://github.com/awslabs/emr-bootstrap-actions/blob/master/spark/README.md#arguments-optional)
in the install-spark emr-bootstrap-action, but the driver failed the very
same way. Furthermore, when launching a spark-shell (on the EMR cluster
with Spark installed with the "-h" option), I also got:


15/06/09 14:20:51 WARN conf.HiveConf: hive-default.xml not found on CLASSPATH
15/06/09 14:20:52 INFO metastore.HiveMetaStore: 0: Opening raw store
with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
15/06/09 14:20:52 INFO metastore.ObjectStore: ObjectStore, initialize called
15/06/09 14:20:52 WARN DataNucleus.General: Plugin (Bundle)
"org.datanucleus" is already registered. Ensure you dont have multiple
JAR versions of the same plugin in the classpath. The URL
"file:/home/hadoop/spark/classpath/hive/datanucleus-core-3.2.10.jar"
is already registered, and you are trying to register an identical
plugin located at URL
"file:/home/hadoop/.versions/spark-1.3.1.e/lib/datanucleus-core-3.2.10.jar."
15/

Re: Apache Phoenix (4.3.1 and 4.4.0-HBase-0.98) on Spark 1.3.1 ClassNotFoundException

2015-06-10 Thread Jeroen Vlek
Hi Josh,

Thank you for your effort. Looking at your code, I feel that mine is 
semantically the same, except written in Java. The dependencies in the pom.xml 
all have the scope provided. The job is submitted as follows:

$ rm spark.log && MASTER=spark://maprdemo:7077 
/opt/mapr/spark/spark-1.3.1/bin/spark-submit-jars 
/home/mapr/projects/customer/lib/spark-streaming-
kafka_2.10-1.3.1.jar,/home/mapr/projects/customer/lib/kafka_2.10-0.8.1.1.jar,/home/mapr/projects/customer/lib/zkclient-0.3.jar,/home/mapr/projects/customer/lib/metrics-
core-3.1.0.jar,/home/mapr/projects/customer/lib/metrics-
core-2.2.0.jar,lib/spark-sql_2.10-1.3.1.jar,/opt/mapr/phoenix/phoenix-4.4.0-
HBase-0.98-bin/phoenix-4.4.0-HBase-0.98-client.jar --class 
nl.work.kafkastreamconsumer.phoenix.KafkaPhoenixConnector 
KafkaStreamConsumer.jar maprdemo:5181 0 topic jdbc:phoenix:maprdemo:5181 true

The spark-defaults.conf is reverted back to its defaults (i.e. no 
userClassPathFirst). In the catch-block of the Phoenix connection buildup  the 
class path is printed by recursively iterating over the class loaders. The 
first one already prints the phoenix-client jar [1]. It's also very unlikely to 
be a bug in Spark or Phoenix, if your proof-of-concept just works.

So if the JAR that contains the offending class is known by the class loader, 
then that might indicate that there's a second JAR providing the same class 
but with a different version, right? 
Yet, the only Phoenix JAR on the whole class path hierarchy is the 
aforementioned phoenix-client JAR. Furthermore, I googled the class in 
question, ClientRpcControllerFactory, and it really only exists in the Phoenix 
project. We're not talking about some low-level AOP Alliance stuff here ;)

Maybe I'm missing some fundamental class loading knowledge, in that case I'd 
be very happy to be enlightened. This all seems very strange.

Cheers,
Jeroen

[1]  [file:/opt/mapr/spark/spark-1.3.1/tmp/app-20150610010512-0001/0/./spark-
streaming-kafka_2.10-1.3.1.jar, 
file:/opt/mapr/spark/spark-1.3.1/tmp/app-20150610010512-0001/0/./kafka_2.10-0.8.1.1.jar,
 
file:/opt/mapr/spark/spark-1.3.1/tmp/app-20150610010512-0001/0/./zkclient-0.3.jar,
 
file:/opt/mapr/spark/spark-1.3.1/tmp/app-20150610010512-0001/0/./phoenix-4.4.0-
HBase-0.98-client.jar, 
file:/opt/mapr/spark/spark-1.3.1/tmp/app-20150610010512-0001/0/./spark-
sql_2.10-1.3.1.jar, 
file:/opt/mapr/spark/spark-1.3.1/tmp/app-20150610010512-0001/0/./metrics-
core-3.1.0.jar, 
file:/opt/mapr/spark/spark-1.3.1/tmp/app-20150610010512-0001/0/./KafkaStreamConsumer.jar,
 
file:/opt/mapr/spark/spark-1.3.1/tmp/app-20150610010512-0001/0/./metrics-
core-2.2.0.jar]


On Tuesday, June 09, 2015 11:18:08 AM Josh Mahonin wrote:
> This may or may not be helpful for your classpath issues, but I wanted to
> verify that basic functionality worked, so I made a sample app here:
> 
> https://github.com/jmahonin/spark-streaming-phoenix
> 
> This consumes events off a Kafka topic using spark streaming, and writes
> out event counts to Phoenix using the new phoenix-spark functionality:
> http://phoenix.apache.org/phoenix_spark.html
> 
> It's definitely overkill, and would probably be more efficient to use the
> JDBC driver directly, but it serves as a proof-of-concept.
> 
> I've only tested this in local mode. To convert it to a full jobs JAR, I
> suspect that keeping all of the spark and phoenix dependencies marked as
> 'provided', and including the Phoenix client JAR in the Spark classpath
> would work as well.
> 
> Good luck,
> 
> Josh
> 
> On Tue, Jun 9, 2015 at 4:40 AM, Jeroen Vlek  wrote:
> > Hi,
> > 
> > I posted a question with regards to Phoenix and Spark Streaming on
> > StackOverflow [1]. Please find a copy of the question to this email below
> > the
> > first stack trace. I also already contacted the Phoenix mailing list and
> > tried
> > the suggestion of setting spark.driver.userClassPathFirst. Unfortunately
> > that
> > only pushed me further into the dependency hell, which I tried to resolve
> > until I hit a wall with an UnsatisfiedLinkError on Snappy.
> > 
> > What I am trying to achieve: To save a stream from Kafka into
> > Phoenix/Hbase
> > via Spark Streaming. I'm using MapR as a platform and the original
> > exception
> > happens both on a 3-node cluster, as on the MapR Sandbox (a VM for
> > experimentation), in YARN and stand-alone mode. Further experimentation
> > (like
> > the saveAsNewHadoopApiFile below), was done only on the sandbox in
> > standalone
> > mode.
> > 
> > Phoenix only supports Spark from 4.4.0 onwards, but I thought I could
> > use a naive implementation that creates a new connection for
> > every RDD from the DStream in 4.3.1.  This resulted in the
> > Class

Re: Spark 1.3.1 SparkSQL metastore exceptions

2015-06-09 Thread Cheng Lian
Seems that you're using a DB2 Hive metastore? I'm not sure whether Hive 
0.12.0 officially supports DB2, but probably not? (Since I didn't find 
DB2 scripts under the metastore/scripts/upgrade folder in Hive source tree.)


Cheng

On 6/9/15 8:28 PM, Needham, Guy wrote:

Hi,
I’m using Spark 1.3.1 to insert into a Hive 0.12 table from a SparkSQL 
query. The query is a very simple select from a dummy Hive table used 
for benchmarking.
I’m using a create table as statement to do the insert. No matter if I 
do that or an insert overwrite, I get the same Hive exception, unable 
to alter table, with some Hive metastore issues.
The data is inserted into the Hive table as expected, however I get a 
very long stacktrace. Does anyone know the meaning of the stacktrace 
and how I can avoid generating it every time I insert into a table?
scala> hiveContext.sql("create table 
benchmarking.spark_logins_benchmark as select * from 
benchmarking.logins_benchmark limit 10")

org.apache.hadoop.hive.ql.metadata.HiveException: Unable to alter table.
at 
org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:387)
at 
org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:1448)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:235)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:123)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.execute(InsertIntoHiveTable.scala:255)
at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:1099)
at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:1099)
at 
org.apache.spark.sql.hive.execution.CreateTableAsSelect.run(CreateTableAsSelect.scala:70)
at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:54)
at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:54)
at 
org.apache.spark.sql.execution.ExecutedCommand.execute(commands.scala:64)
at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:1099)
at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:1099)

at org.apache.spark.sql.DataFrame.(DataFrame.scala:147)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:130)
at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
at 
org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:101)

at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:24)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:29)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:31)
at $iwC$$iwC$$iwC$$iwC$$iwC.(:33)
at $iwC$$iwC$$iwC$$iwC.(:35)
at $iwC$$iwC$$iwC.(:37)
at $iwC$$iwC.(:39)
at $iwC.(:41)
at (:43)
at .(:47)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at 
org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at 
org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:856)
at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:901)

at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:813)
at 
org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:656)
at 
org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:664)
at 
org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:669)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:996)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
at 
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at 
org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:944)

at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1058)
at org.apache.spark.repl.Main$.ma

Re: Apache Phoenix (4.3.1 and 4.4.0-HBase-0.98) on Spark 1.3.1 ClassNotFoundException

2015-06-09 Thread Josh Mahonin
ting the
> following exception when opening a connection via the JDBC driver (cut
> for brevity, full stacktrace below):
>
> Caused by: java.lang.ClassNotFoundException:
> org.apache.hadoop.hbase.ipc.controller.ClientRpcControllerFactory
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
>
> The class in question is provided by the jar called phoenix-
> core-4.3.1.jar (despite it being in the HBase package namespace, I
> guess they need it to integrate with HBase).
>
> There are numerous questions on SO about ClassNotFoundExceptions
> on Spark and I've tried the fat-jar approach (both with Maven's
> assembly and shade plugins; I've inspected the jars, they **do**
> contain ClientRpcControllerFactory), and I've tried a lean jar while
> specifying the jars on the command line. For the latter, the command
> I used is as follows:
>
> /opt/mapr/spark/spark-1.3.1/bin/spark-submit --jars lib/spark-
> streaming-
>
> kafka_10-1.3.1.jar,lib/kafka_2.10-0.8.1.1.jar,lib/zkclient-0.3.jar,lib/metrics-
> core-3.1.0.jar,lib/metrics-core-2.2.0.jar,lib/phoenix-core-4.3.1.jar --
> class nl.work.kafkastreamconsumer.phoenix.KafkaPhoenixConnector
> KafkaStreamConsumer.jar node1:5181 0 topic
> jdbc:phoenix:node1:5181 true
>
> I've also done a classpath dump from within the code and the first
> classloader in the hierarchy already knows the Phoenix jar:
>
> 2015-06-04 10:52:34,323 [Executor task launch worker-1] INFO
> nl.work.kafkastreamconsumer.phoenix.LinePersister -
> [file:/home/work/projects/customer/KafkaStreamConsumer.jar,
> file:/home/work/projects/customer/lib/spark-streaming-
> kafka_2.10-1.3.1.jar,
> file:/home/work/projects/customer/lib/kafka_2.10-0.8.1.1.jar,
> file:/home/work/projects/customer/lib/zkclient-0.3.jar,
> file:/home/work/projects/customer/lib/metrics-core-3.1.0.jar,
> file:/home/work/projects/customer/lib/metrics-core-2.2.0.jar,
> file:/home/work/projects/customer/lib/phoenix-core-4.3.1.jar]
>
> So the question is: What am I missing here? Why can't Spark load the
> correct class? There should be only one version of the class flying
> around (namely the one from phoenix-core), so I doubt it's a
> versioning conflict.
>
> [Executor task launch worker-3] ERROR
> nl.work.kafkastreamconsumer.phoenix.LinePersister - Error while
> processing line
> java.lang.RuntimeException: java.sql.SQLException: ERROR 103
> (08004): Unable to establish connection.
> at
>
> nl.work.kafkastreamconsumer.phoenix.PhoenixConnection.(PhoenixConnection.java:41)
> at
>
> nl.work.kafkastreamconsumer.phoenix.LinePersister$1.call(LinePersister.java:40)
> at
>
> nl.work.kafkastreamconsumer.phoenix.LinePersister$1.call(LinePersister.java:32)
> at
>
> org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:999)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at
> scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at scala.collection.generic.Growable$class.
> $plus$plus$eq(Growable.scala:48)
> at scala.collection.mutable.ArrayBuffer.
> $plus$plus$eq(ArrayBuffer.scala:103)
> at scala.collection.mutable.ArrayBuffer.
> $plus$plus$eq(ArrayBuffer.scala:47)
> at
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
> at scala.collection.AbstractIterator.to(Iterator.scala:1157)
> at
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
> at
> scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
> at
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
> at
> scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
> at
> org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813)
> at
> org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813)
> at
>
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498)
> at
>
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498)
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> at org.apache.spark.scheduler.Task.run(Task.scala:64)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
>

Spark 1.3.1 SparkSQL metastore exceptions

2015-06-09 Thread Needham, Guy
Hi,

I'm using Spark 1.3.1 to insert into a Hive 0.12 table from a SparkSQL query. 
The query is a very simple select from a dummy Hive table used for benchmarking.
I'm using a create table as statement to do the insert. No matter if I do that 
or an insert overwrite, I get the same Hive exception, unable to alter table, 
with some Hive metastore issues.

The data is inserted into the Hive table as expected, however I get a very long 
stacktrace. Does anyone know the meaning of the stacktrace and how I can avoid 
generating it every time I insert into a table?

scala> hiveContext.sql("create table benchmarking.spark_logins_benchmark as 
select * from benchmarking.logins_benchmark limit 10")
org.apache.hadoop.hive.ql.metadata.HiveException: Unable to alter table.
at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:387)
at org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:1448)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:235)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:123)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.execute(InsertIntoHiveTable.scala:255)
at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:1099)
at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:1099)
at 
org.apache.spark.sql.hive.execution.CreateTableAsSelect.run(CreateTableAsSelect.scala:70)
at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:54)
at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:54)
at 
org.apache.spark.sql.execution.ExecutedCommand.execute(commands.scala:64)
at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:1099)
at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:1099)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:147)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:130)
at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:101)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:24)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:29)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:31)
at $iwC$$iwC$$iwC$$iwC$$iwC.(:33)
at $iwC$$iwC$$iwC$$iwC.(:35)
at $iwC$$iwC$$iwC.(:37)
at $iwC$$iwC.(:39)
at $iwC.(:41)
at (:43)
at .(:47)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:856)
at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:901)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:813)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:656)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:664)
at 
org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:669)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:996)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
at 
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at 
org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:944)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1058)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMe

Apache Phoenix (4.3.1 and 4.4.0-HBase-0.98) on Spark 1.3.1 ClassNotFoundException

2015-06-09 Thread Jeroen Vlek
va:57)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at 
java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at 
org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:455)
... 26 more
Caused by: java.lang.UnsupportedOperationException: Unable to find 
org.apache.hadoop.hbase.ipc.controller.ClientRpcControllerFactory
at 
org.apache.hadoop.hbase.util.ReflectionUtils.instantiateWithCustomCtor(ReflectionUtils.java:36)
at 
org.apache.hadoop.hbase.ipc.RpcControllerFactory.instantiate(RpcControllerFactory.java:56)
at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.(HConnectionManager.java:769)
at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.(HConnectionManager.java:689)
... 31 more
Caused by: java.lang.ClassNotFoundException: 
org.apache.hadoop.hbase.ipc.controller.ClientRpcControllerFactory
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at 
sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:191)
at 
org.apache.hadoop.hbase.util.ReflectionUtils.instantiateWithCustomCtor(ReflectionUtils.java:32)
... 34 more

Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at 
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at 
org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)


== Below is my question from StackOverflow ==

I'm trying to connect to Phoenix via Spark and I keep getting the 
following exception when opening a connection via the JDBC driver (cut 
for brevity, full stacktrace below):

Caused by: java.lang.ClassNotFoundException: 
org.apache.hadoop.hbase.ipc.controller.ClientRpcControllerFactory
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)

The class in question is provided by the jar called phoenix-
core-4.3.1.jar (despite it being in the HBase package namespace, I 
guess they need it to integrate with HBase).

There are numerous questions on SO about ClassNotFoundExceptions 
on Spark and I've tried the fat-jar approach (both with Maven's 
assembly and shade plugins; I've inspected the jars, they **do** 
contain ClientRpcControllerFactory), and I've tried a lean jar while 
specifying the jars on the command line. For the latter, the command 
I used is as follows:

    /opt/mapr/spark/spark-1.3.1/bin/spark-submit --jars lib/spark-
streaming-
kafka_10-1.3.1.jar,lib/kafka_2.10-0.8.1.1.jar,lib/zkclient-0.3.jar,lib/metrics-
core-3.1.0.jar,lib/metrics-core-2.2.0.jar,lib/phoenix-core-4.3.1.jar --
class nl.work.kafkastreamconsumer.phoenix.KafkaPhoenixConnector 
KafkaStreamConsumer.jar node1:5181 0 topic 
jdbc:phoenix:node1:5181 true

I've also done a classpath dump from within the code and the first 
classloader in the hierarchy already knows the Phoenix jar:

2015-06-04 10:52:34,323 [Executor task launch worker-1] INFO  
nl.work.kafkastreamconsumer.phoenix.LinePersister - 
[file:/home/work/projects/customer/KafkaStreamConsumer.jar, 
file:/home/work/projects/customer/lib/spark-streaming-
kafka_2.10-1.3.1.jar, 
file:/home/work/projects/customer/lib/kafka_2.10-0.8.1.1.jar, 
file:/home/work

Re: Spark 1.3.1 On Mesos Issues.

2015-06-08 Thread John Omernik
It appears this may be related.

https://issues.apache.org/jira/browse/SPARK-1403

Granted the NPE is in MapR's code, having Spark (seemingly, I am not an
expert here, just basing it off the comments) switch in its behavior (if
that's what it is doing) probably isn't good either. I guess the level that
this is happening at is way above my head.  :)



On Fri, Jun 5, 2015 at 4:38 PM, John Omernik  wrote:

> Thanks all. The answers post is me too, I multi thread. That and Ted is
> aware to and Mapr is helping me with it.  I shall report the answer of that
> investigation when we have it.
>
> As to reproduction, I've installed mapr file system, tired both version
> 4.0.2 and 4.1.0.  Have mesos running along side mapr, and then I use
> standard methods for submitting spark jobs to mesos. I don't have my
> configs now, on vacation :) but I can shar on Monday.
>
> I appreciate the support I am getting from every one, mesos community,
> spark community, and mapr.  Great to see folks solving problems and I will
> be sure report back findings as they arise.
>
>
>
> On Friday, June 5, 2015, Tim Chen  wrote:
>
>> It seems like there is another thread going on:
>>
>>
>> http://answers.mapr.com/questions/163353/spark-from-apache-downloads-site-for-mapr.html
>>
>> I'm not particularly sure why, seems like the problem is that getting the
>> current context class loader is returning null in this instance.
>>
>> Do you have some repro steps or config we can try this?
>>
>> Tim
>>
>> On Fri, Jun 5, 2015 at 3:40 AM, Steve Loughran 
>> wrote:
>>
>>>
>>>  On 2 Jun 2015, at 00:14, Dean Wampler  wrote:
>>>
>>>  It would be nice to see the code for MapR FS Java API, but my google
>>> foo failed me (assuming it's open source)...
>>>
>>>
>>>  I know that MapRFS is closed source, don't know about the java JAR.
>>> Why not ask Ted Dunning (cc'd)  nicely to see if he can track down the
>>> stack trace for you.
>>>
>>>   So, shooting in the dark ;) there are a few things I would check, if
>>> you haven't already:
>>>
>>>  1. Could there be 1.2 versions of some Spark jars that get picked up
>>> at run time (but apparently not in local mode) on one or more nodes? (Side
>>> question: Does your node experiment fail on all nodes?) Put another way,
>>> are the classpaths good for all JVM tasks?
>>> 2. Can you use just MapR and Spark 1.3.1 successfully, bypassing Mesos?
>>>
>>>  Incidentally, how are you combining Mesos and MapR? Are you running
>>> Spark in Mesos, but accessing data in MapR-FS?
>>>
>>>  Perhaps the MapR "shim" library doesn't support Spark 1.3.1.
>>>
>>>  HTH,
>>>
>>>  dean
>>>
>>>  Dean Wampler, Ph.D.
>>> Author: Programming Scala, 2nd Edition
>>> <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
>>> Typesafe <http://typesafe.com/>
>>> @deanwampler <http://twitter.com/deanwampler>
>>> http://polyglotprogramming.com
>>>
>>> On Mon, Jun 1, 2015 at 2:49 PM, John Omernik  wrote:
>>>
>>>> All -
>>>>
>>>>  I am facing and odd issue and I am not really sure where to go for
>>>> support at this point.  I am running MapR which complicates things as it
>>>> relates to Mesos, however this HAS worked in the past with no issues so I
>>>> am stumped here.
>>>>
>>>>  So for starters, here is what I am trying to run. This is a simple
>>>> show tables using the Hive Context:
>>>>
>>>>  from pyspark import SparkContext, SparkConf
>>>> from pyspark.sql import SQLContext, Row, HiveContext
>>>> sparkhc = HiveContext(sc)
>>>> test = sparkhc.sql("show tables")
>>>> for r in test.collect():
>>>>   print r
>>>>
>>>>  When I run it on 1.3.1 using ./bin/pyspark --master local  This works
>>>> with no issues.
>>>>
>>>>  When I run it using Mesos with all the settings configured (as they
>>>> had worked in the past) I get lost tasks and when I zoom in them, the error
>>>> that is being reported is below.  Basically it's a NullPointerException on
>>>> the com.mapr.fs.ShimLoader.  What's weird to me is is I took each instance
>>>> and compared both together, the class path, everything is exactly the same.
>>>> Yet running in local mode works, and runnin

Re: Caching parquet table (with GZIP) on Spark 1.3.1

2015-06-07 Thread Cheng Lian
Is it possible that some Parquet files of this data set have different 
schema as others? Especially those ones reported in the exception messages.


One way to confirm this is to use [parquet-tools] [1] to inspect these 
files:


$ parquet-schema 

Cheng

[1]: https://github.com/apache/parquet-mr/tree/master/parquet-tools

On 5/26/15 3:26 PM, shsh...@tsmc.com wrote:


we tried to cache table through
hiveCtx = HiveContext(sc)
hiveCtx.cacheTable("table name")
as described on Spark 1.3.1's document and we're on CDH5.3.0 with Spark
1.3.1 built with Hadoop 2.6
following error message would occur if we tried to cache table with parquet
format & GZIP
though we're not sure if this error message has anything to do with the
table format since we can execute SQLs on the exact same table,
we just hope to use cachTable so that it might speed-up a little bit since
we're querying on this table for several times.
Any advise is welcomed! Thanks!

15/05/26 15:21:32 WARN scheduler.TaskSetManager: Lost task 227.0 in stage
0.0 (TID 278, f14ecats037): parquet.io.ParquetDecodingException: Can not
read value at 0 in block -1 in file
hdfs://f14ecat/tmp/tchart_0501_final/part-r-1198.parquet
 at parquet.hadoop.InternalParquetRecordReader.nextKeyValue
(InternalParquetRecordReader.java:213)
 at parquet.hadoop.ParquetRecordReader.nextKeyValue
(ParquetRecordReader.java:204)
 at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext
(NewHadoopRDD.scala:143)
 at org.apache.spark.InterruptibleIterator.hasNext
(InterruptibleIterator.scala:39)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon
$1.hasNext(InMemoryColumnarTableScan.scala:153)
 at org.apache.spark.storage.MemoryStore.unrollSafely
(MemoryStore.scala:248)
 at org.apache.spark.CacheManager.putInBlockManager
(CacheManager.scala:172)
 at org.apache.spark.CacheManager.getOrCompute
(CacheManager.scala:79)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
 at org.apache.spark.rdd.MapPartitionsRDD.compute
(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at org.apache.spark.rdd.MapPartitionsRDD.compute
(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at org.apache.spark.rdd.MapPartitionsRDD.compute
(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at org.apache.spark.rdd.MapPartitionsRDD.compute
(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at org.apache.spark.rdd.MapPartitionsRDD.compute
(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask
(ShuffleMapTask.scala:68)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask
(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:64)
 at org.apache.spark.executor.Executor$TaskRunner.run
(Executor.scala:203)
 at java.util.concurrent.ThreadPoolExecutor.runWorker
(ThreadPoolExecutor.java:1142)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run
(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
Caused by: parquet.io.ParquetDecodingException: The requested schema is not
compatible with the file schema. incompatible types: optional binary
dcqv_val (UTF8) != optional double dcqv_val
 at parquet.io.ColumnIOFactory
$ColumnIOCreatorVisitor.incompatibleSchema(ColumnIOFactory.java:105)
 at parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit
(ColumnIOFactory.java:97)
 at parquet.schema.PrimitiveType.accept(PrimitiveType.java:386)
 at parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visitChildren
(ColumnIOFactory.java:87)
 at parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit
(ColumnIOFactory.java:61)
 at parquet.schema.MessageType.accept(MessageType.java:55)
 at parquet.io.ColumnIOFactory.getColumnIO(ColumnIOFactory.java:148)
 at parquet.hadoop.InternalParquetRecordReader.checkRead
(InternalParquetRecordReader.java:125)
 at parquet.hadoop.InternalParquetRecordReader.nextKeyValue
(InternalParquetRecordReader.java:193)
 ... 31 more

15/05/26 15:21:32 INFO scheduler.TaskSetManager: Starting task 74.2 in
stage 0.0 (TID 377, f14ecats025, NODE_LOCAL, 2153 bytes)
15/05/26 15:21:32 INFO scheduler.TaskSetManager: Lo

Re: Which class takes place of BlockManagerWorker in Spark 1.3.1

2015-06-06 Thread Ted Yu
Hi,
Please take a look at:
[SPARK-3019] Pluggable block transfer interface (BlockTransferService)

- NioBlockTransferService implements BlockTransferService and replaces the
old BlockManagerWorker

Cheers

On Sat, Jun 6, 2015 at 2:23 AM, bit1...@163.com  wrote:

> Hi,
> I remembered that there is a class called BlockManagerWorker in spark
> previous releases. In the 1.3.1 code, I could see that some method
> comment still refers to BlockManagerWorker which doesn't exist at all.
> I would ask which class takes place of BlockManagerWorker in Spark 1.3.1?
> Thanks.
>
> BTW, BlockManagerMaster is there, it makes no sense that
> BlockManagerWorker is gone.
>
> --
> bit1...@163.com
>


Which class takes place of BlockManagerWorker in Spark 1.3.1

2015-06-06 Thread bit1...@163.com
Hi,
I remembered that there is a class called BlockManagerWorker in spark previous 
releases. In the 1.3.1 code, I could see that some method comment still refers 
to BlockManagerWorker which doesn't exist at all.
I would ask which class takes place of BlockManagerWorker in Spark 1.3.1? 
Thanks. 

BTW, BlockManagerMaster is there, it makes no sense that BlockManagerWorker is 
gone.



bit1...@163.com


Re: Spark 1.3.1 On Mesos Issues.

2015-06-05 Thread John Omernik
Thanks all. The answers post is me too, I multi thread. That and Ted is
aware to and Mapr is helping me with it.  I shall report the answer of that
investigation when we have it.

As to reproduction, I've installed mapr file system, tired both version
4.0.2 and 4.1.0.  Have mesos running along side mapr, and then I use
standard methods for submitting spark jobs to mesos. I don't have my
configs now, on vacation :) but I can shar on Monday.

I appreciate the support I am getting from every one, mesos community,
spark community, and mapr.  Great to see folks solving problems and I will
be sure report back findings as they arise.



On Friday, June 5, 2015, Tim Chen  wrote:

> It seems like there is another thread going on:
>
>
> http://answers.mapr.com/questions/163353/spark-from-apache-downloads-site-for-mapr.html
>
> I'm not particularly sure why, seems like the problem is that getting the
> current context class loader is returning null in this instance.
>
> Do you have some repro steps or config we can try this?
>
> Tim
>
> On Fri, Jun 5, 2015 at 3:40 AM, Steve Loughran  > wrote:
>
>>
>>  On 2 Jun 2015, at 00:14, Dean Wampler > > wrote:
>>
>>  It would be nice to see the code for MapR FS Java API, but my google
>> foo failed me (assuming it's open source)...
>>
>>
>>  I know that MapRFS is closed source, don't know about the java JAR. Why
>> not ask Ted Dunning (cc'd)  nicely to see if he can track down the stack
>> trace for you.
>>
>>   So, shooting in the dark ;) there are a few things I would check, if
>> you haven't already:
>>
>>  1. Could there be 1.2 versions of some Spark jars that get picked up at
>> run time (but apparently not in local mode) on one or more nodes? (Side
>> question: Does your node experiment fail on all nodes?) Put another way,
>> are the classpaths good for all JVM tasks?
>> 2. Can you use just MapR and Spark 1.3.1 successfully, bypassing Mesos?
>>
>>  Incidentally, how are you combining Mesos and MapR? Are you running
>> Spark in Mesos, but accessing data in MapR-FS?
>>
>>  Perhaps the MapR "shim" library doesn't support Spark 1.3.1.
>>
>>  HTH,
>>
>>  dean
>>
>>  Dean Wampler, Ph.D.
>> Author: Programming Scala, 2nd Edition
>> <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
>> Typesafe <http://typesafe.com/>
>> @deanwampler <http://twitter.com/deanwampler>
>> http://polyglotprogramming.com
>>
>> On Mon, Jun 1, 2015 at 2:49 PM, John Omernik > > wrote:
>>
>>> All -
>>>
>>>  I am facing and odd issue and I am not really sure where to go for
>>> support at this point.  I am running MapR which complicates things as it
>>> relates to Mesos, however this HAS worked in the past with no issues so I
>>> am stumped here.
>>>
>>>  So for starters, here is what I am trying to run. This is a simple
>>> show tables using the Hive Context:
>>>
>>>  from pyspark import SparkContext, SparkConf
>>> from pyspark.sql import SQLContext, Row, HiveContext
>>> sparkhc = HiveContext(sc)
>>> test = sparkhc.sql("show tables")
>>> for r in test.collect():
>>>   print r
>>>
>>>  When I run it on 1.3.1 using ./bin/pyspark --master local  This works
>>> with no issues.
>>>
>>>  When I run it using Mesos with all the settings configured (as they
>>> had worked in the past) I get lost tasks and when I zoom in them, the error
>>> that is being reported is below.  Basically it's a NullPointerException on
>>> the com.mapr.fs.ShimLoader.  What's weird to me is is I took each instance
>>> and compared both together, the class path, everything is exactly the same.
>>> Yet running in local mode works, and running in mesos fails.  Also of note,
>>> when the task is scheduled to run on the same node as when I run locally,
>>> that fails too! (Baffling).
>>>
>>>  Ok, for comparison, how I configured Mesos was to download the mapr4
>>> package from spark.apache.org.  Using the exact same configuration file
>>> (except for changing the executor tgz from 1.2.0 to 1.3.1) from the 1.2.0.
>>> When I run this example with the mapr4 for 1.2.0 there is no issue in
>>> Mesos, everything runs as intended. Using the same package for 1.3.1 then
>>> it fails.
>>>
>>>  (Also of note, 1.2.1 gives a 404 error, 1.2.2 fails, and 1.3.0 fails
>>> as well).
>>>
>>>  So basically When I us

Re: Spark 1.3.1 On Mesos Issues.

2015-06-05 Thread Tim Chen
It seems like there is another thread going on:

http://answers.mapr.com/questions/163353/spark-from-apache-downloads-site-for-mapr.html

I'm not particularly sure why, seems like the problem is that getting the
current context class loader is returning null in this instance.

Do you have some repro steps or config we can try this?

Tim

On Fri, Jun 5, 2015 at 3:40 AM, Steve Loughran 
wrote:

>
>  On 2 Jun 2015, at 00:14, Dean Wampler  wrote:
>
>  It would be nice to see the code for MapR FS Java API, but my google foo
> failed me (assuming it's open source)...
>
>
>  I know that MapRFS is closed source, don't know about the java JAR. Why
> not ask Ted Dunning (cc'd)  nicely to see if he can track down the stack
> trace for you.
>
>   So, shooting in the dark ;) there are a few things I would check, if
> you haven't already:
>
>  1. Could there be 1.2 versions of some Spark jars that get picked up at
> run time (but apparently not in local mode) on one or more nodes? (Side
> question: Does your node experiment fail on all nodes?) Put another way,
> are the classpaths good for all JVM tasks?
> 2. Can you use just MapR and Spark 1.3.1 successfully, bypassing Mesos?
>
>  Incidentally, how are you combining Mesos and MapR? Are you running
> Spark in Mesos, but accessing data in MapR-FS?
>
>  Perhaps the MapR "shim" library doesn't support Spark 1.3.1.
>
>  HTH,
>
>  dean
>
>  Dean Wampler, Ph.D.
> Author: Programming Scala, 2nd Edition
> <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
> Typesafe <http://typesafe.com/>
> @deanwampler <http://twitter.com/deanwampler>
> http://polyglotprogramming.com
>
> On Mon, Jun 1, 2015 at 2:49 PM, John Omernik  wrote:
>
>> All -
>>
>>  I am facing and odd issue and I am not really sure where to go for
>> support at this point.  I am running MapR which complicates things as it
>> relates to Mesos, however this HAS worked in the past with no issues so I
>> am stumped here.
>>
>>  So for starters, here is what I am trying to run. This is a simple show
>> tables using the Hive Context:
>>
>>  from pyspark import SparkContext, SparkConf
>> from pyspark.sql import SQLContext, Row, HiveContext
>> sparkhc = HiveContext(sc)
>> test = sparkhc.sql("show tables")
>> for r in test.collect():
>>   print r
>>
>>  When I run it on 1.3.1 using ./bin/pyspark --master local  This works
>> with no issues.
>>
>>  When I run it using Mesos with all the settings configured (as they had
>> worked in the past) I get lost tasks and when I zoom in them, the error
>> that is being reported is below.  Basically it's a NullPointerException on
>> the com.mapr.fs.ShimLoader.  What's weird to me is is I took each instance
>> and compared both together, the class path, everything is exactly the same.
>> Yet running in local mode works, and running in mesos fails.  Also of note,
>> when the task is scheduled to run on the same node as when I run locally,
>> that fails too! (Baffling).
>>
>>  Ok, for comparison, how I configured Mesos was to download the mapr4
>> package from spark.apache.org.  Using the exact same configuration file
>> (except for changing the executor tgz from 1.2.0 to 1.3.1) from the 1.2.0.
>> When I run this example with the mapr4 for 1.2.0 there is no issue in
>> Mesos, everything runs as intended. Using the same package for 1.3.1 then
>> it fails.
>>
>>  (Also of note, 1.2.1 gives a 404 error, 1.2.2 fails, and 1.3.0 fails as
>> well).
>>
>>  So basically When I used 1.2.0 and followed a set of steps, it worked
>> on Mesos and 1.3.1 fails.  Since this is a "current" version of Spark, MapR
>> is supports 1.2.1 only.  (Still working on that).
>>
>>  I guess I am at a loss right now on why this would be happening, any
>> pointers on where I could look or what I could tweak would be greatly
>> appreciated. Additionally, if there is something I could specifically draw
>> to the attention of MapR on this problem please let me know, I am perplexed
>> on the change from 1.2.0 to 1.3.1.
>>
>>  Thank you,
>>
>>  John
>>
>>
>>
>>
>>  Full Error on 1.3.1 on Mesos:
>> 15/05/19 09:31:26 INFO MemoryStore: MemoryStore started with capacity
>> 1060.3 MB java.lang.NullPointerException at
>> com.mapr.fs.ShimLoader.getRootClassLoader(ShimLoader.java:96) at
>> com.mapr.fs.ShimLoader.injectNativeLoader(ShimLoader.java:232) at
>> com.mapr.fs.ShimLoader.load(ShimLoader.java:194) at
>> org.apach

Re: Spark 1.3.1 On Mesos Issues.

2015-06-05 Thread Steve Loughran

On 2 Jun 2015, at 00:14, Dean Wampler 
mailto:deanwamp...@gmail.com>> wrote:

It would be nice to see the code for MapR FS Java API, but my google foo failed 
me (assuming it's open source)...


I know that MapRFS is closed source, don't know about the java JAR. Why not ask 
Ted Dunning (cc'd)  nicely to see if he can track down the stack trace for you.

So, shooting in the dark ;) there are a few things I would check, if you 
haven't already:

1. Could there be 1.2 versions of some Spark jars that get picked up at run 
time (but apparently not in local mode) on one or more nodes? (Side question: 
Does your node experiment fail on all nodes?) Put another way, are the 
classpaths good for all JVM tasks?
2. Can you use just MapR and Spark 1.3.1 successfully, bypassing Mesos?

Incidentally, how are you combining Mesos and MapR? Are you running Spark in 
Mesos, but accessing data in MapR-FS?

Perhaps the MapR "shim" library doesn't support Spark 1.3.1.

HTH,

dean

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd 
Edition<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe<http://typesafe.com/>
@deanwampler<http://twitter.com/deanwampler>
http://polyglotprogramming.com<http://polyglotprogramming.com/>

On Mon, Jun 1, 2015 at 2:49 PM, John Omernik 
mailto:j...@omernik.com>> wrote:
All -

I am facing and odd issue and I am not really sure where to go for support at 
this point.  I am running MapR which complicates things as it relates to Mesos, 
however this HAS worked in the past with no issues so I am stumped here.

So for starters, here is what I am trying to run. This is a simple show tables 
using the Hive Context:

from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext, Row, HiveContext
sparkhc = HiveContext(sc)
test = sparkhc.sql("show tables")
for r in test.collect():
  print r

When I run it on 1.3.1 using ./bin/pyspark --master local  This works with no 
issues.

When I run it using Mesos with all the settings configured (as they had worked 
in the past) I get lost tasks and when I zoom in them, the error that is being 
reported is below.  Basically it's a NullPointerException on the 
com.mapr.fs.ShimLoader.  What's weird to me is is I took each instance and 
compared both together, the class path, everything is exactly the same. Yet 
running in local mode works, and running in mesos fails.  Also of note, when 
the task is scheduled to run on the same node as when I run locally, that fails 
too! (Baffling).

Ok, for comparison, how I configured Mesos was to download the mapr4 package 
from spark.apache.org<http://spark.apache.org/>.  Using the exact same 
configuration file (except for changing the executor tgz from 1.2.0 to 1.3.1) 
from the 1.2.0.  When I run this example with the mapr4 for 1.2.0 there is no 
issue in Mesos, everything runs as intended. Using the same package for 1.3.1 
then it fails.

(Also of note, 1.2.1 gives a 404 error, 1.2.2 fails, and 1.3.0 fails as well).

So basically When I used 1.2.0 and followed a set of steps, it worked on Mesos 
and 1.3.1 fails.  Since this is a "current" version of Spark, MapR is supports 
1.2.1 only.  (Still working on that).

I guess I am at a loss right now on why this would be happening, any pointers 
on where I could look or what I could tweak would be greatly appreciated. 
Additionally, if there is something I could specifically draw to the attention 
of MapR on this problem please let me know, I am perplexed on the change from 
1.2.0 to 1.3.1.

Thank you,

John




Full Error on 1.3.1 on Mesos:
15/05/19 09:31:26 INFO MemoryStore: MemoryStore started with capacity 1060.3 MB 
java.lang.NullPointerException at 
com.mapr.fs.ShimLoader.getRootClassLoader(ShimLoader.java:96) at 
com.mapr.fs.ShimLoader.injectNativeLoader(ShimLoader.java:232) at 
com.mapr.fs.ShimLoader.load(ShimLoader.java:194) at 
org.apache.hadoop.conf.CoreDefaultProperties.(CoreDefaultProperties.java:60) at 
java.lang.Class.forName0(Native Method) at 
java.lang.Class.forName(Class.java:274) at 
org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:1847)
 at org.apache.hadoop.conf.Configuration.getProperties(Configuration.java:2062) 
at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2272) 
at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2224) 
at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2141) at 
org.apache.hadoop.conf.Configuration.set(Configuration.java:992) at 
org.apache.hadoop.conf.Configuration.set(Configuration.java:966) at 
org.apache.spark.deploy.SparkHadoopUtil.newConfiguration(SparkHadoopUtil.scala:98)
 at org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:43) at 
org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala:220) at 
org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala) a

Error running sbt package on Windows 7 for Spark 1.3.1 and SimpleApp.scala

2015-06-04 Thread Joseph Washington
Hi all,
I'm trying to run the standalone application SimpleApp.scala following the
instructions on the
http://spark.apache.org/docs/latest/quick-start.html#a-standalone-app-in-scala
I was able to create a .jar file by doing sbt package. However when I tried
to do

$ YOUR_SPARK_HOME/bin/spark-submit --class "SimpleApp" --master local[4]
c:/myproject/target/scala-2.10/simple-project_2.10-1.0.jar

I didn't get the desired result. There is a lot of output, but a few areas,
said ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
[image: Inline image 2]

Furthermore, trying sbt run and sbt compile from the myproject folder gives
this error:


[image: Inline image 1]

Any ideas?


Re: Spark 1.3.1 On Mesos Issues.

2015-06-04 Thread John Omernik
So a few updates.  When I run local as stated before, it works fine. When I
run in Yarn (via Apache Myriad on Mesos) it also runs fine. The only issue
is specifically with Mesos. I wonder if there is some sort of class path
goodness I need to fix or something along that lines.  Any tips would be
appreciated.

Thanks!

John

On Mon, Jun 1, 2015 at 6:14 PM, Dean Wampler  wrote:

> It would be nice to see the code for MapR FS Java API, but my google foo
> failed me (assuming it's open source)...
>
> So, shooting in the dark ;) there are a few things I would check, if you
> haven't already:
>
> 1. Could there be 1.2 versions of some Spark jars that get picked up at
> run time (but apparently not in local mode) on one or more nodes? (Side
> question: Does your node experiment fail on all nodes?) Put another way,
> are the classpaths good for all JVM tasks?
> 2. Can you use just MapR and Spark 1.3.1 successfully, bypassing Mesos?
>
> Incidentally, how are you combining Mesos and MapR? Are you running Spark
> in Mesos, but accessing data in MapR-FS?
>
> Perhaps the MapR "shim" library doesn't support Spark 1.3.1.
>
> HTH,
>
> dean
>
> Dean Wampler, Ph.D.
> Author: Programming Scala, 2nd Edition
> <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
> Typesafe <http://typesafe.com>
> @deanwampler <http://twitter.com/deanwampler>
> http://polyglotprogramming.com
>
> On Mon, Jun 1, 2015 at 2:49 PM, John Omernik  wrote:
>
>> All -
>>
>> I am facing and odd issue and I am not really sure where to go for
>> support at this point.  I am running MapR which complicates things as it
>> relates to Mesos, however this HAS worked in the past with no issues so I
>> am stumped here.
>>
>> So for starters, here is what I am trying to run. This is a simple show
>> tables using the Hive Context:
>>
>> from pyspark import SparkContext, SparkConf
>> from pyspark.sql import SQLContext, Row, HiveContext
>> sparkhc = HiveContext(sc)
>> test = sparkhc.sql("show tables")
>> for r in test.collect():
>>   print r
>>
>> When I run it on 1.3.1 using ./bin/pyspark --master local  This works
>> with no issues.
>>
>> When I run it using Mesos with all the settings configured (as they had
>> worked in the past) I get lost tasks and when I zoom in them, the error
>> that is being reported is below.  Basically it's a NullPointerException on
>> the com.mapr.fs.ShimLoader.  What's weird to me is is I took each instance
>> and compared both together, the class path, everything is exactly the same.
>> Yet running in local mode works, and running in mesos fails.  Also of note,
>> when the task is scheduled to run on the same node as when I run locally,
>> that fails too! (Baffling).
>>
>> Ok, for comparison, how I configured Mesos was to download the mapr4
>> package from spark.apache.org.  Using the exact same configuration file
>> (except for changing the executor tgz from 1.2.0 to 1.3.1) from the 1.2.0.
>> When I run this example with the mapr4 for 1.2.0 there is no issue in
>> Mesos, everything runs as intended. Using the same package for 1.3.1 then
>> it fails.
>>
>> (Also of note, 1.2.1 gives a 404 error, 1.2.2 fails, and 1.3.0 fails as
>> well).
>>
>> So basically When I used 1.2.0 and followed a set of steps, it worked on
>> Mesos and 1.3.1 fails.  Since this is a "current" version of Spark, MapR is
>> supports 1.2.1 only.  (Still working on that).
>>
>> I guess I am at a loss right now on why this would be happening, any
>> pointers on where I could look or what I could tweak would be greatly
>> appreciated. Additionally, if there is something I could specifically draw
>> to the attention of MapR on this problem please let me know, I am perplexed
>> on the change from 1.2.0 to 1.3.1.
>>
>> Thank you,
>>
>> John
>>
>>
>>
>>
>> Full Error on 1.3.1 on Mesos:
>> 15/05/19 09:31:26 INFO MemoryStore: MemoryStore started with capacity
>> 1060.3 MB java.lang.NullPointerException at
>> com.mapr.fs.ShimLoader.getRootClassLoader(ShimLoader.java:96) at
>> com.mapr.fs.ShimLoader.injectNativeLoader(ShimLoader.java:232) at
>> com.mapr.fs.ShimLoader.load(ShimLoader.java:194) at
>> org.apache.hadoop.conf.CoreDefaultProperties.(CoreDefaultProperties.java:60)
>> at java.lang.Class.forName0(Native Method) at
>> java.lang.Class.forName(Class.java:274) at
>> org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:1847)
>> at
>> org.apache.hadoop.conf

Re: Re: spark 1.3.1 jars in repo1.maven.org

2015-06-02 Thread Sean Owen
We are having a separate discussion about this but, I don't understand why
this is a problem? You're supposed to build Spark for Hadoop 1 if you run
it on Hadoop 1 and I am not sure that is happening here, given the error. I
do not think this should change as I do not see that it fixes something.

Let's please concentrate the follow up on the JIRA since you already made
one.

On Wed, Jun 3, 2015 at 2:26 AM, Shixiong Zhu  wrote:

> Ryan - I sent a PR to fix your issue:
> https://github.com/apache/spark/pull/6599
>
> Edward - I have no idea why the following error happened. "ContextCleaner"
> doesn't use any Hadoop API. Could you try Spark 1.3.0? It's supposed to
> support both hadoop 1 and hadoop 2.
>
> * "Exception in thread "Spark Context Cleaner"
> java.lang.NoClassDefFoundError: 0
> at
> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:149)"
>
>
> Best Regards,
> Shixiong Zhu
>
> 2015-06-03 0:08 GMT+08:00 Ryan Williams :
>
>> I think this is causing issues upgrading ADAM
>> <https://github.com/bigdatagenomics/adam> to Spark 1.3.1 (cf. adam#690
>> <https://github.com/bigdatagenomics/adam/pull/690#issuecomment-107769383>);
>> attempting to build against Hadoop 1.0.4 yields errors like:
>>
>> 2015-06-02 15:57:44 ERROR Executor:96 - Exception in task 0.0 in stage
>> 0.0 (TID 0)
>> *java.lang.IncompatibleClassChangeError: Found class
>> org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected*
>> at
>> org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:95)
>> at org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106)
>> at
>> org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1082)
>> at
>> org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059)
>> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>> at org.apache.spark.scheduler.Task.run(Task.scala:64)
>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> at java.lang.Thread.run(Thread.java:745)
>> 2015-06-02 15:57:44 WARN  TaskSetManager:71 - Lost task 0.0 in stage 0.0
>> (TID 0, localhost): java.lang.IncompatibleClassChangeError: Found class
>> org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected
>> at
>> org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:95)
>> at org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106)
>> at
>> org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1082)
>> at
>> org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059)
>> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>> at org.apache.spark.scheduler.Task.run(Task.scala:64)
>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> at java.lang.Thread.run(Thread.java:745)
>>
>> TaskAttemptContext is a class in Hadoop 1.0.4, but an interface in Hadoop
>> 2; Spark 1.3.1 expects the interface but is getting the class.
>>
>> It sounds like, while I *can* depend on Spark 1.3.1 and Hadoop 1.0.4, I
>> then need to hope that I don't exercise certain Spark code paths that run
>> afoul of differences between Hadoop 1 and 2; does that seem correct?
>>
>> Thanks!
>>
>> On Wed, May 20, 2015 at 1:52 PM Sean Owen  wrote:
>>
>>> I don't think any of those problems are related to Hadoop. Have you
>>> looked at userClassPathFirst settings?
>>>
>>> On Wed, May 20, 2015 at 6:46 PM, Edward Sargisson 
>>> wrote:
>>>
>>>> Hi Sean and Ted,
>>>> Thanks for your replies.
>>>>
>>>> I don't have our current problems nicely written up as good questions
>>>> yet. I'm still sorting out classpath issues, etc.
>>>> In case it is of help, I'm seeing:
>>>> * "Exception in thread "Spark Context Cleaner"
>>>> java.lang.NoClassDefFoundError: 0
>>>> at
>>>> org.apache.spark.ContextCleaner$$a

Re: Re: spark 1.3.1 jars in repo1.maven.org

2015-06-02 Thread Ryan Williams
Thanks so much Shixiong! This is great.

On Tue, Jun 2, 2015 at 8:26 PM Shixiong Zhu  wrote:

> Ryan - I sent a PR to fix your issue:
> https://github.com/apache/spark/pull/6599
>
> Edward - I have no idea why the following error happened. "ContextCleaner"
> doesn't use any Hadoop API. Could you try Spark 1.3.0? It's supposed to
> support both hadoop 1 and hadoop 2.
>
>
> * "Exception in thread "Spark Context Cleaner"
> java.lang.NoClassDefFoundError: 0
> at
> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:149)"
>
>
> Best Regards,
> Shixiong Zhu
>
> 2015-06-03 0:08 GMT+08:00 Ryan Williams :
>
>> I think this is causing issues upgrading ADAM
>> <https://github.com/bigdatagenomics/adam> to Spark 1.3.1 (cf. adam#690
>> <https://github.com/bigdatagenomics/adam/pull/690#issuecomment-107769383>);
>> attempting to build against Hadoop 1.0.4 yields errors like:
>>
>> 2015-06-02 15:57:44 ERROR Executor:96 - Exception in task 0.0 in stage
>> 0.0 (TID 0)
>> *java.lang.IncompatibleClassChangeError: Found class
>> org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected*
>> at
>> org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:95)
>> at org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106)
>> at
>> org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1082)
>> at
>> org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059)
>> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>> at org.apache.spark.scheduler.Task.run(Task.scala:64)
>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> at java.lang.Thread.run(Thread.java:745)
>> 2015-06-02 15:57:44 WARN  TaskSetManager:71 - Lost task 0.0 in stage 0.0
>> (TID 0, localhost): java.lang.IncompatibleClassChangeError: Found class
>> org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected
>> at
>> org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:95)
>> at org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106)
>> at
>> org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1082)
>> at
>> org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059)
>> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>> at org.apache.spark.scheduler.Task.run(Task.scala:64)
>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> at java.lang.Thread.run(Thread.java:745)
>>
>> TaskAttemptContext is a class in Hadoop 1.0.4, but an interface in Hadoop
>> 2; Spark 1.3.1 expects the interface but is getting the class.
>>
>> It sounds like, while I *can* depend on Spark 1.3.1 and Hadoop 1.0.4, I
>> then need to hope that I don't exercise certain Spark code paths that run
>> afoul of differences between Hadoop 1 and 2; does that seem correct?
>>
>> Thanks!
>>
>> On Wed, May 20, 2015 at 1:52 PM Sean Owen  wrote:
>>
>>> I don't think any of those problems are related to Hadoop. Have you
>>> looked at userClassPathFirst settings?
>>>
>>> On Wed, May 20, 2015 at 6:46 PM, Edward Sargisson 
>>> wrote:
>>>
>>>> Hi Sean and Ted,
>>>> Thanks for your replies.
>>>>
>>>> I don't have our current problems nicely written up as good questions
>>>> yet. I'm still sorting out classpath issues, etc.
>>>> In case it is of help, I'm seeing:
>>>> * "Exception in thread "Spark Context Cleaner"
>>>> java.lang.NoClassDefFoundError: 0
>>>> at
>>>> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:149)"
>>>> * We've been having clashing dependencies between a colleague and I
>>>> because of the aforementioned classpath issue
>>>> * The clashing dependencies are also causing issues with what jetty
>>>&

Re: Re: spark 1.3.1 jars in repo1.maven.org

2015-06-02 Thread Shixiong Zhu
Ryan - I sent a PR to fix your issue:
https://github.com/apache/spark/pull/6599

Edward - I have no idea why the following error happened. "ContextCleaner"
doesn't use any Hadoop API. Could you try Spark 1.3.0? It's supposed to
support both hadoop 1 and hadoop 2.

* "Exception in thread "Spark Context Cleaner"
java.lang.NoClassDefFoundError: 0
at
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:149)"


Best Regards,
Shixiong Zhu

2015-06-03 0:08 GMT+08:00 Ryan Williams :

> I think this is causing issues upgrading ADAM
> <https://github.com/bigdatagenomics/adam> to Spark 1.3.1 (cf. adam#690
> <https://github.com/bigdatagenomics/adam/pull/690#issuecomment-107769383>);
> attempting to build against Hadoop 1.0.4 yields errors like:
>
> 2015-06-02 15:57:44 ERROR Executor:96 - Exception in task 0.0 in stage 0.0
> (TID 0)
> *java.lang.IncompatibleClassChangeError: Found class
> org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected*
> at
> org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:95)
> at org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106)
> at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1082)
> at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> at org.apache.spark.scheduler.Task.run(Task.scala:64)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 2015-06-02 15:57:44 WARN  TaskSetManager:71 - Lost task 0.0 in stage 0.0
> (TID 0, localhost): java.lang.IncompatibleClassChangeError: Found class
> org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected
> at
> org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:95)
> at org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106)
> at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1082)
> at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> at org.apache.spark.scheduler.Task.run(Task.scala:64)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
> TaskAttemptContext is a class in Hadoop 1.0.4, but an interface in Hadoop
> 2; Spark 1.3.1 expects the interface but is getting the class.
>
> It sounds like, while I *can* depend on Spark 1.3.1 and Hadoop 1.0.4, I
> then need to hope that I don't exercise certain Spark code paths that run
> afoul of differences between Hadoop 1 and 2; does that seem correct?
>
> Thanks!
>
> On Wed, May 20, 2015 at 1:52 PM Sean Owen  wrote:
>
>> I don't think any of those problems are related to Hadoop. Have you
>> looked at userClassPathFirst settings?
>>
>> On Wed, May 20, 2015 at 6:46 PM, Edward Sargisson 
>> wrote:
>>
>>> Hi Sean and Ted,
>>> Thanks for your replies.
>>>
>>> I don't have our current problems nicely written up as good questions
>>> yet. I'm still sorting out classpath issues, etc.
>>> In case it is of help, I'm seeing:
>>> * "Exception in thread "Spark Context Cleaner"
>>> java.lang.NoClassDefFoundError: 0
>>> at
>>> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:149)"
>>> * We've been having clashing dependencies between a colleague and I
>>> because of the aforementioned classpath issue
>>> * The clashing dependencies are also causing issues with what jetty
>>> libraries are available in the classloader from Spark and don't clash with
>>> existing libraries we have.
>>>
>>> More anon,
>>>
>>> Cheers,
>>> Edward
>>>
>>>
>>>
>>>  Original Message 
>>>  Subject: Re: spark 1.3.1 jars in repo1.maven.org Date: 2015-05-20 00:38
>>> From: Sean Owen  To: Edward Sargisson <
>>> esa...@pobox.com&g

Re: Re: spark 1.3.1 jars in repo1.maven.org

2015-06-02 Thread Ryan Williams
I think this is causing issues upgrading ADAM
<https://github.com/bigdatagenomics/adam> to Spark 1.3.1 (cf. adam#690
<https://github.com/bigdatagenomics/adam/pull/690#issuecomment-107769383>);
attempting to build against Hadoop 1.0.4 yields errors like:

2015-06-02 15:57:44 ERROR Executor:96 - Exception in task 0.0 in stage 0.0
(TID 0)
*java.lang.IncompatibleClassChangeError: Found class
org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected*
at
org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:95)
at org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1082)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
2015-06-02 15:57:44 WARN  TaskSetManager:71 - Lost task 0.0 in stage 0.0
(TID 0, localhost): java.lang.IncompatibleClassChangeError: Found class
org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected
at
org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:95)
at org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1082)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

TaskAttemptContext is a class in Hadoop 1.0.4, but an interface in Hadoop
2; Spark 1.3.1 expects the interface but is getting the class.

It sounds like, while I *can* depend on Spark 1.3.1 and Hadoop 1.0.4, I
then need to hope that I don't exercise certain Spark code paths that run
afoul of differences between Hadoop 1 and 2; does that seem correct?

Thanks!

On Wed, May 20, 2015 at 1:52 PM Sean Owen  wrote:

> I don't think any of those problems are related to Hadoop. Have you looked
> at userClassPathFirst settings?
>
> On Wed, May 20, 2015 at 6:46 PM, Edward Sargisson 
> wrote:
>
>> Hi Sean and Ted,
>> Thanks for your replies.
>>
>> I don't have our current problems nicely written up as good questions
>> yet. I'm still sorting out classpath issues, etc.
>> In case it is of help, I'm seeing:
>> * "Exception in thread "Spark Context Cleaner"
>> java.lang.NoClassDefFoundError: 0
>> at
>> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:149)"
>> * We've been having clashing dependencies between a colleague and I
>> because of the aforementioned classpath issue
>> * The clashing dependencies are also causing issues with what jetty
>> libraries are available in the classloader from Spark and don't clash with
>> existing libraries we have.
>>
>> More anon,
>>
>> Cheers,
>> Edward
>>
>>
>>
>>  Original Message 
>>  Subject: Re: spark 1.3.1 jars in repo1.maven.org Date: 2015-05-20 00:38
>> From: Sean Owen  To: Edward Sargisson <
>> esa...@pobox.com> Cc: user 
>>
>>
>> Yes, the published artifacts can only refer to one version of anything
>> (OK, modulo publishing a large number of variants under classifiers).
>>
>> You aren't intended to rely on Spark's transitive dependencies for
>> anything. Compiling against the Spark API has no relation to what
>> version of Hadoop it binds against because it's not part of any API.
>> You mark the Spark dependency even as "provided" in your build and get
>> all the Spark/Hadoop bindings at runtime from our cluster.
>>
>> What problem are you experiencing?
>>
>>
>> On Wed, May 20, 2015 at 2:17 AM, Edward Sargisson 
>> wrote:
>>
>> Hi,
>> I'd like to confirm an observation I've just made. Specifically that spark
>> is only available in repo1.maven.org for one Hadoop variant.
>>
>> The Spark source can be compiled against a nu

Re: Spark 1.3.1 bundle does not build - unresolved dependency

2015-06-02 Thread Akhil Das
You can try to skip the tests, try with:

mvn -Dhadoop.version=2.4.0 -Pyarn *-DskipTests* clean package


Thanks
Best Regards

On Tue, Jun 2, 2015 at 2:51 AM, Stephen Boesch  wrote:

> I downloaded the 1.3.1 distro tarball
>
> $ll ../spark-1.3.1.tar.gz
> -rw-r-@ 1 steve  staff  8500861 Apr 23 09:58 ../spark-1.3.1.tar.gz
>
> However the build on it is failing with an unresolved dependency: 
> *configuration
> not public*
>
> $ build/sbt   assembly -Dhadoop.version=2.5.2 -Pyarn -Phadoop-2.4
>
> [error] (network-shuffle/*:update) sbt.ResolveException: *unresolved
> dependency: *org.apache.spark#spark-network-common_2.10;1.3.1: *configuration
> not public* in org.apache.spark#spark-network-common_2.10;1.3.1: 'test'.
> It was required from org.apache.spark#spark-network-shuffle_2.10;1.3.1 test
>
> Is there a known workaround for this?
>
> thanks
>
>


Re: Spark 1.3.1 On Mesos Issues.

2015-06-01 Thread Dean Wampler
It would be nice to see the code for MapR FS Java API, but my google foo
failed me (assuming it's open source)...

So, shooting in the dark ;) there are a few things I would check, if you
haven't already:

1. Could there be 1.2 versions of some Spark jars that get picked up at run
time (but apparently not in local mode) on one or more nodes? (Side
question: Does your node experiment fail on all nodes?) Put another way,
are the classpaths good for all JVM tasks?
2. Can you use just MapR and Spark 1.3.1 successfully, bypassing Mesos?

Incidentally, how are you combining Mesos and MapR? Are you running Spark
in Mesos, but accessing data in MapR-FS?

Perhaps the MapR "shim" library doesn't support Spark 1.3.1.

HTH,

dean

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.com

On Mon, Jun 1, 2015 at 2:49 PM, John Omernik  wrote:

> All -
>
> I am facing and odd issue and I am not really sure where to go for support
> at this point.  I am running MapR which complicates things as it relates to
> Mesos, however this HAS worked in the past with no issues so I am stumped
> here.
>
> So for starters, here is what I am trying to run. This is a simple show
> tables using the Hive Context:
>
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import SQLContext, Row, HiveContext
> sparkhc = HiveContext(sc)
> test = sparkhc.sql("show tables")
> for r in test.collect():
>   print r
>
> When I run it on 1.3.1 using ./bin/pyspark --master local  This works with
> no issues.
>
> When I run it using Mesos with all the settings configured (as they had
> worked in the past) I get lost tasks and when I zoom in them, the error
> that is being reported is below.  Basically it's a NullPointerException on
> the com.mapr.fs.ShimLoader.  What's weird to me is is I took each instance
> and compared both together, the class path, everything is exactly the same.
> Yet running in local mode works, and running in mesos fails.  Also of note,
> when the task is scheduled to run on the same node as when I run locally,
> that fails too! (Baffling).
>
> Ok, for comparison, how I configured Mesos was to download the mapr4
> package from spark.apache.org.  Using the exact same configuration file
> (except for changing the executor tgz from 1.2.0 to 1.3.1) from the 1.2.0.
> When I run this example with the mapr4 for 1.2.0 there is no issue in
> Mesos, everything runs as intended. Using the same package for 1.3.1 then
> it fails.
>
> (Also of note, 1.2.1 gives a 404 error, 1.2.2 fails, and 1.3.0 fails as
> well).
>
> So basically When I used 1.2.0 and followed a set of steps, it worked on
> Mesos and 1.3.1 fails.  Since this is a "current" version of Spark, MapR is
> supports 1.2.1 only.  (Still working on that).
>
> I guess I am at a loss right now on why this would be happening, any
> pointers on where I could look or what I could tweak would be greatly
> appreciated. Additionally, if there is something I could specifically draw
> to the attention of MapR on this problem please let me know, I am perplexed
> on the change from 1.2.0 to 1.3.1.
>
> Thank you,
>
> John
>
>
>
>
> Full Error on 1.3.1 on Mesos:
> 15/05/19 09:31:26 INFO MemoryStore: MemoryStore started with capacity
> 1060.3 MB java.lang.NullPointerException at
> com.mapr.fs.ShimLoader.getRootClassLoader(ShimLoader.java:96) at
> com.mapr.fs.ShimLoader.injectNativeLoader(ShimLoader.java:232) at
> com.mapr.fs.ShimLoader.load(ShimLoader.java:194) at
> org.apache.hadoop.conf.CoreDefaultProperties.(CoreDefaultProperties.java:60)
> at java.lang.Class.forName0(Native Method) at
> java.lang.Class.forName(Class.java:274) at
> org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:1847)
> at
> org.apache.hadoop.conf.Configuration.getProperties(Configuration.java:2062)
> at
> org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2272)
> at
> org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2224)
> at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2141)
> at org.apache.hadoop.conf.Configuration.set(Configuration.java:992) at
> org.apache.hadoop.conf.Configuration.set(Configuration.java:966) at
> org.apache.spark.deploy.SparkHadoopUtil.newConfiguration(SparkHadoopUtil.scala:98)
> at org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:43) at
> org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala:220) at
> org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala) at
> org.apache.spark.

Spark 1.3.1 bundle does not build - unresolved dependency

2015-06-01 Thread Stephen Boesch
I downloaded the 1.3.1 distro tarball

$ll ../spark-1.3.1.tar.gz
-rw-r-@ 1 steve  staff  8500861 Apr 23 09:58 ../spark-1.3.1.tar.gz

However the build on it is failing with an unresolved dependency:
*configuration
not public*

$ build/sbt   assembly -Dhadoop.version=2.5.2 -Pyarn -Phadoop-2.4

[error] (network-shuffle/*:update) sbt.ResolveException: *unresolved
dependency: *org.apache.spark#spark-network-common_2.10;1.3.1: *configuration
not public* in org.apache.spark#spark-network-common_2.10;1.3.1: 'test'. It
was required from org.apache.spark#spark-network-shuffle_2.10;1.3.1 test

Is there a known workaround for this?

thanks


Spark 1.3.1 On Mesos Issues.

2015-06-01 Thread John Omernik
All -

I am facing and odd issue and I am not really sure where to go for support
at this point.  I am running MapR which complicates things as it relates to
Mesos, however this HAS worked in the past with no issues so I am stumped
here.

So for starters, here is what I am trying to run. This is a simple show
tables using the Hive Context:

from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext, Row, HiveContext
sparkhc = HiveContext(sc)
test = sparkhc.sql("show tables")
for r in test.collect():
  print r

When I run it on 1.3.1 using ./bin/pyspark --master local  This works with
no issues.

When I run it using Mesos with all the settings configured (as they had
worked in the past) I get lost tasks and when I zoom in them, the error
that is being reported is below.  Basically it's a NullPointerException on
the com.mapr.fs.ShimLoader.  What's weird to me is is I took each instance
and compared both together, the class path, everything is exactly the same.
Yet running in local mode works, and running in mesos fails.  Also of note,
when the task is scheduled to run on the same node as when I run locally,
that fails too! (Baffling).

Ok, for comparison, how I configured Mesos was to download the mapr4
package from spark.apache.org.  Using the exact same configuration file
(except for changing the executor tgz from 1.2.0 to 1.3.1) from the 1.2.0.
When I run this example with the mapr4 for 1.2.0 there is no issue in
Mesos, everything runs as intended. Using the same package for 1.3.1 then
it fails.

(Also of note, 1.2.1 gives a 404 error, 1.2.2 fails, and 1.3.0 fails as
well).

So basically When I used 1.2.0 and followed a set of steps, it worked on
Mesos and 1.3.1 fails.  Since this is a "current" version of Spark, MapR is
supports 1.2.1 only.  (Still working on that).

I guess I am at a loss right now on why this would be happening, any
pointers on where I could look or what I could tweak would be greatly
appreciated. Additionally, if there is something I could specifically draw
to the attention of MapR on this problem please let me know, I am perplexed
on the change from 1.2.0 to 1.3.1.

Thank you,

John




Full Error on 1.3.1 on Mesos:
15/05/19 09:31:26 INFO MemoryStore: MemoryStore started with capacity
1060.3 MB java.lang.NullPointerException at
com.mapr.fs.ShimLoader.getRootClassLoader(ShimLoader.java:96) at
com.mapr.fs.ShimLoader.injectNativeLoader(ShimLoader.java:232) at
com.mapr.fs.ShimLoader.load(ShimLoader.java:194) at
org.apache.hadoop.conf.CoreDefaultProperties.(CoreDefaultProperties.java:60)
at java.lang.Class.forName0(Native Method) at
java.lang.Class.forName(Class.java:274) at
org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:1847)
at
org.apache.hadoop.conf.Configuration.getProperties(Configuration.java:2062)
at
org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2272)
at
org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2224)
at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2141)
at org.apache.hadoop.conf.Configuration.set(Configuration.java:992) at
org.apache.hadoop.conf.Configuration.set(Configuration.java:966) at
org.apache.spark.deploy.SparkHadoopUtil.newConfiguration(SparkHadoopUtil.scala:98)
at org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:43) at
org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala:220) at
org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala) at
org.apache.spark.util.Utils$.getSparkOrYarnConfig(Utils.scala:1959) at
org.apache.spark.storage.BlockManager.(BlockManager.scala:104) at
org.apache.spark.storage.BlockManager.(BlockManager.scala:179) at
org.apache.spark.SparkEnv$.create(SparkEnv.scala:310) at
org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:186) at
org.apache.spark.executor.MesosExecutorBackend.registered(MesosExecutorBackend.scala:70)
java.lang.RuntimeException: Failure loading MapRClient. at
com.mapr.fs.ShimLoader.injectNativeLoader(ShimLoader.java:283) at
com.mapr.fs.ShimLoader.load(ShimLoader.java:194) at
org.apache.hadoop.conf.CoreDefaultProperties.(CoreDefaultProperties.java:60)
at java.lang.Class.forName0(Native Method) at
java.lang.Class.forName(Class.java:274) at
org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:1847)
at
org.apache.hadoop.conf.Configuration.getProperties(Configuration.java:2062)
at
org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2272)
at
org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2224)
at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2141)
at org.apache.hadoop.conf.Configuration.set(Configuration.java:992) at
org.apache.hadoop.conf.Configuration.set(Configuration.java:966) at
org.apache.spark.deploy.SparkHadoopUtil.newConfiguration(SparkHadoopUtil.scala:98)
at org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:43) at
org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala

Caching parquet table (with GZIP) on Spark 1.3.1

2015-05-26 Thread shshann


we tried to cache table through
hiveCtx = HiveContext(sc)
hiveCtx.cacheTable("table name")
as described on Spark 1.3.1's document and we're on CDH5.3.0 with Spark
1.3.1 built with Hadoop 2.6
following error message would occur if we tried to cache table with parquet
format & GZIP
though we're not sure if this error message has anything to do with the
table format since we can execute SQLs on the exact same table,
we just hope to use cachTable so that it might speed-up a little bit since
we're querying on this table for several times.
Any advise is welcomed! Thanks!

15/05/26 15:21:32 WARN scheduler.TaskSetManager: Lost task 227.0 in stage
0.0 (TID 278, f14ecats037): parquet.io.ParquetDecodingException: Can not
read value at 0 in block -1 in file
hdfs://f14ecat/tmp/tchart_0501_final/part-r-1198.parquet
at parquet.hadoop.InternalParquetRecordReader.nextKeyValue
(InternalParquetRecordReader.java:213)
at parquet.hadoop.ParquetRecordReader.nextKeyValue
(ParquetRecordReader.java:204)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext
(NewHadoopRDD.scala:143)
at org.apache.spark.InterruptibleIterator.hasNext
(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon
$1.hasNext(InMemoryColumnarTableScan.scala:153)
at org.apache.spark.storage.MemoryStore.unrollSafely
(MemoryStore.scala:248)
at org.apache.spark.CacheManager.putInBlockManager
(CacheManager.scala:172)
at org.apache.spark.CacheManager.getOrCompute
(CacheManager.scala:79)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
at org.apache.spark.rdd.MapPartitionsRDD.compute
(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute
(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute
(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute
(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute
(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ShuffleMapTask.runTask
(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask
(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run
(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker
(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run
(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: parquet.io.ParquetDecodingException: The requested schema is not
compatible with the file schema. incompatible types: optional binary
dcqv_val (UTF8) != optional double dcqv_val
at parquet.io.ColumnIOFactory
$ColumnIOCreatorVisitor.incompatibleSchema(ColumnIOFactory.java:105)
at parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit
(ColumnIOFactory.java:97)
at parquet.schema.PrimitiveType.accept(PrimitiveType.java:386)
at parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visitChildren
(ColumnIOFactory.java:87)
at parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit
(ColumnIOFactory.java:61)
at parquet.schema.MessageType.accept(MessageType.java:55)
at parquet.io.ColumnIOFactory.getColumnIO(ColumnIOFactory.java:148)
at parquet.hadoop.InternalParquetRecordReader.checkRead
(InternalParquetRecordReader.java:125)
at parquet.hadoop.InternalParquetRecordReader.nextKeyValue
(InternalParquetRecordReader.java:193)
... 31 more

15/05/26 15:21:32 INFO scheduler.TaskSetManager: Starting task 74.2 in
stage 0.0 (TID 377, f14ecats025, NODE_LOCAL, 2153 bytes)
15/05/26 15:21:32 INFO scheduler.TaskSetManager: Lost task 56.2 in stage
0.0 (TID 329) on executor f14ecats025: parquet.io.ParquetDecodingException
(Can not read value at 0 in block -1 in file
hdfs://f14ecat/tmp/tchart_0501_final/part-r-1047.parquet) [duplicate 2]
15/05/26 15:21:32 INFO scheduler.TaskSetManager: Starting task 165.1 in
stage 0.0 (TID 378, f14ecats026, NODE_LOCAL, 2151 bytes)
15/05/26 15:21:32 WARN scheduler.TaskSetManager: Lost task 145.0 

Re: Spark 1.3.1 - SQL Issues

2015-05-20 Thread ayan guha
Thanks a bunch
On 21 May 2015 07:11, "Davies Liu"  wrote:

> The docs had been updated.
>
> You should convert the DataFrame to RDD by `df.rdd`
>
> On Mon, Apr 20, 2015 at 5:23 AM, ayan guha  wrote:
> > Hi
> > Just upgraded to Spark 1.3.1.
> >
> > I am getting an warning
> >
> > Warning (from warnings module):
> >   File
> >
> "D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\sql\context.py",
> > line 191
> > warnings.warn("inferSchema is deprecated, please use createDataFrame
> > instead")
> > UserWarning: inferSchema is deprecated, please use createDataFrame
> instead
> >
> > However, documentation still says to use inferSchema.
> > Here: http://spark.apache.org/docs/latest/sql-programming-guide.htm in
> > section
> >
> > Also, I am getting an error in mlib.ALS.train function when passing
> > dataframe (do I need to convert the DF to RDD?)
> >
> > Code:
> > training = ssc.sql("select userId,movieId,rating from ratings where
> > partitionKey < 6").cache()
> > print type(training)
> > model = ALS.train(training,rank,numIter,lmbda)
> >
> > Error:
> > 
> > Rank:8 Lmbda:1.0 iteration:10
> >
> > Traceback (most recent call last):
> >   File "D:\Project\Spark\code\movie_sql.py", line 109, in 
> >     bestConf =
> getBestModel(sc,ssc,training,validation,validationNoRating)
> >   File "D:\Project\Spark\code\movie_sql.py", line 54, in getBestModel
> > model = ALS.train(trainingRDD,rank,numIter,lmbda)
> >   File
> >
> "D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\mllib\recommendation.py",
> > line 139, in train
> > model = callMLlibFunc("trainALSModel", cls._prepare(ratings), rank,
> > iterations,
> >   File
> >
> "D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\mllib\recommendation.py",
> > line 127, in _prepare
> > assert isinstance(ratings, RDD), "ratings should be RDD"
> > AssertionError: ratings should be RDD
> >
> > --
> > Best Regards,
> > Ayan Guha
>


Re: Spark 1.3.1 - SQL Issues

2015-05-20 Thread Davies Liu
The docs had been updated.

You should convert the DataFrame to RDD by `df.rdd`

On Mon, Apr 20, 2015 at 5:23 AM, ayan guha  wrote:
> Hi
> Just upgraded to Spark 1.3.1.
>
> I am getting an warning
>
> Warning (from warnings module):
>   File
> "D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\sql\context.py",
> line 191
> warnings.warn("inferSchema is deprecated, please use createDataFrame
> instead")
> UserWarning: inferSchema is deprecated, please use createDataFrame instead
>
> However, documentation still says to use inferSchema.
> Here: http://spark.apache.org/docs/latest/sql-programming-guide.htm in
> section
>
> Also, I am getting an error in mlib.ALS.train function when passing
> dataframe (do I need to convert the DF to RDD?)
>
> Code:
> training = ssc.sql("select userId,movieId,rating from ratings where
> partitionKey < 6").cache()
> print type(training)
> model = ALS.train(training,rank,numIter,lmbda)
>
> Error:
> 
> Rank:8 Lmbda:1.0 iteration:10
>
> Traceback (most recent call last):
>   File "D:\Project\Spark\code\movie_sql.py", line 109, in 
> bestConf = getBestModel(sc,ssc,training,validation,validationNoRating)
>   File "D:\Project\Spark\code\movie_sql.py", line 54, in getBestModel
> model = ALS.train(trainingRDD,rank,numIter,lmbda)
>   File
> "D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\mllib\recommendation.py",
> line 139, in train
> model = callMLlibFunc("trainALSModel", cls._prepare(ratings), rank,
> iterations,
>   File
> "D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\mllib\recommendation.py",
> line 127, in _prepare
> assert isinstance(ratings, RDD), "ratings should be RDD"
> AssertionError: ratings should be RDD
>
> --
> Best Regards,
> Ayan Guha

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Re: spark 1.3.1 jars in repo1.maven.org

2015-05-20 Thread Sean Owen
I don't think any of those problems are related to Hadoop. Have you looked
at userClassPathFirst settings?

On Wed, May 20, 2015 at 6:46 PM, Edward Sargisson  wrote:

> Hi Sean and Ted,
> Thanks for your replies.
>
> I don't have our current problems nicely written up as good questions yet.
> I'm still sorting out classpath issues, etc.
> In case it is of help, I'm seeing:
> * "Exception in thread "Spark Context Cleaner"
> java.lang.NoClassDefFoundError: 0
> at
> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:149)"
> * We've been having clashing dependencies between a colleague and I
> because of the aforementioned classpath issue
> * The clashing dependencies are also causing issues with what jetty
> libraries are available in the classloader from Spark and don't clash with
> existing libraries we have.
>
> More anon,
>
> Cheers,
> Edward
>
>
>
>  Original Message 
>  Subject: Re: spark 1.3.1 jars in repo1.maven.org Date: 2015-05-20 00:38
> From: Sean Owen  To: Edward Sargisson <
> esa...@pobox.com> Cc: user 
>
>
> Yes, the published artifacts can only refer to one version of anything
> (OK, modulo publishing a large number of variants under classifiers).
>
> You aren't intended to rely on Spark's transitive dependencies for
> anything. Compiling against the Spark API has no relation to what
> version of Hadoop it binds against because it's not part of any API.
> You mark the Spark dependency even as "provided" in your build and get
> all the Spark/Hadoop bindings at runtime from our cluster.
>
> What problem are you experiencing?
>
>
> On Wed, May 20, 2015 at 2:17 AM, Edward Sargisson 
> wrote:
>
> Hi,
> I'd like to confirm an observation I've just made. Specifically that spark
> is only available in repo1.maven.org for one Hadoop variant.
>
> The Spark source can be compiled against a number of different Hadoops
> using
> profiles. Yay.
> However, the spark jars in repo1.maven.org appear to be compiled against
> one
> specific Hadoop and no other differentiation is made. (I can see a
> difference with hadoop-client being 2.2.0 in repo1.maven.org and 1.0.4 in
> the version I compiled locally).
>
> The implication here is that if you have a pom file asking for
> spark-core_2.10 version 1.3.1 then Maven will only give you an Hadoop 2
> version. Maven assumes that non-snapshot artifacts never change so trying
> to
> load an Hadoop 1 version will end in tears.
>
> This then means that if you compile code against spark-core then there will
> probably be classpath NoClassDefFound issues unless the Hadoop 2 version is
> exactly the one you want.
>
> Have I gotten this correct?
>
> It happens that our little app is using a Spark context directly from a
> Jetty webapp and the classpath differences were/are causing some confusion.
> We are currently installing a Hadoop 1 spark master and worker.
>
> Thanks a lot!
> Edward
>
>
>
>


Fwd: Re: spark 1.3.1 jars in repo1.maven.org

2015-05-20 Thread Edward Sargisson
Hi Sean and Ted,
Thanks for your replies.

I don't have our current problems nicely written up as good questions yet.
I'm still sorting out classpath issues, etc.
In case it is of help, I'm seeing:
* "Exception in thread "Spark Context Cleaner"
java.lang.NoClassDefFoundError: 0
at
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:149)"
* We've been having clashing dependencies between a colleague and I because
of the aforementioned classpath issue
* The clashing dependencies are also causing issues with what jetty
libraries are available in the classloader from Spark and don't clash with
existing libraries we have.

More anon,

Cheers,
Edward



 Original Message 
 Subject: Re: spark 1.3.1 jars in repo1.maven.org Date: 2015-05-20 00:38
From: Sean Owen  To: Edward Sargisson 
Cc: user 


Yes, the published artifacts can only refer to one version of anything
(OK, modulo publishing a large number of variants under classifiers).

You aren't intended to rely on Spark's transitive dependencies for
anything. Compiling against the Spark API has no relation to what
version of Hadoop it binds against because it's not part of any API.
You mark the Spark dependency even as "provided" in your build and get
all the Spark/Hadoop bindings at runtime from our cluster.

What problem are you experiencing?


On Wed, May 20, 2015 at 2:17 AM, Edward Sargisson  wrote:

Hi,
I'd like to confirm an observation I've just made. Specifically that spark
is only available in repo1.maven.org for one Hadoop variant.

The Spark source can be compiled against a number of different Hadoops using
profiles. Yay.
However, the spark jars in repo1.maven.org appear to be compiled against one
specific Hadoop and no other differentiation is made. (I can see a
difference with hadoop-client being 2.2.0 in repo1.maven.org and 1.0.4 in
the version I compiled locally).

The implication here is that if you have a pom file asking for
spark-core_2.10 version 1.3.1 then Maven will only give you an Hadoop 2
version. Maven assumes that non-snapshot artifacts never change so trying to
load an Hadoop 1 version will end in tears.

This then means that if you compile code against spark-core then there will
probably be classpath NoClassDefFound issues unless the Hadoop 2 version is
exactly the one you want.

Have I gotten this correct?

It happens that our little app is using a Spark context directly from a
Jetty webapp and the classpath differences were/are causing some confusion.
We are currently installing a Hadoop 1 spark master and worker.

Thanks a lot!
Edward


Re: spark 1.3.1 jars in repo1.maven.org

2015-05-20 Thread Sean Owen
Yes, the published artifacts can only refer to one version of anything
(OK, modulo publishing a large number of variants under classifiers).

You aren't intended to rely on Spark's transitive dependencies for
anything. Compiling against the Spark API has no relation to what
version of Hadoop it binds against because it's not part of any API.
You mark the Spark dependency even as "provided" in your build and get
all the Spark/Hadoop bindings at runtime from our cluster.

What problem are you experiencing?

On Wed, May 20, 2015 at 2:17 AM, Edward Sargisson  wrote:
> Hi,
> I'd like to confirm an observation I've just made. Specifically that spark
> is only available in repo1.maven.org for one Hadoop variant.
>
> The Spark source can be compiled against a number of different Hadoops using
> profiles. Yay.
> However, the spark jars in repo1.maven.org appear to be compiled against one
> specific Hadoop and no other differentiation is made. (I can see a
> difference with hadoop-client being 2.2.0 in repo1.maven.org and 1.0.4 in
> the version I compiled locally).
>
> The implication here is that if you have a pom file asking for
> spark-core_2.10 version 1.3.1 then Maven will only give you an Hadoop 2
> version. Maven assumes that non-snapshot artifacts never change so trying to
> load an Hadoop 1 version will end in tears.
>
> This then means that if you compile code against spark-core then there will
> probably be classpath NoClassDefFound issues unless the Hadoop 2 version is
> exactly the one you want.
>
> Have I gotten this correct?
>
> It happens that our little app is using a Spark context directly from a
> Jetty webapp and the classpath differences were/are causing some confusion.
> We are currently installing a Hadoop 1 spark master and worker.
>
> Thanks a lot!
> Edward

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark 1.3.1 jars in repo1.maven.org

2015-05-19 Thread Ted Yu
I think your observation is correct.
e.g.
http://mvnrepository.com/artifact/org.apache.spark/spark-core_2.10/1.3.1
shows that it depends on hadoop-client
 from
hadoop 2.2

Cheers

On Tue, May 19, 2015 at 6:17 PM, Edward Sargisson  wrote:

> Hi,
> I'd like to confirm an observation I've just made. Specifically that spark
> is only available in repo1.maven.org for one Hadoop variant.
>
> The Spark source can be compiled against a number of different Hadoops
> using profiles. Yay.
> However, the spark jars in repo1.maven.org appear to be compiled against
> one specific Hadoop and no other differentiation is made. (I can see a
> difference with hadoop-client being 2.2.0 in repo1.maven.org and 1.0.4 in
> the version I compiled locally).
>
> The implication here is that if you have a pom file asking for
> spark-core_2.10 version 1.3.1 then Maven will only give you an Hadoop 2
> version. Maven assumes that non-snapshot artifacts never change so trying
> to load an Hadoop 1 version will end in tears.
>
> This then means that if you compile code against spark-core then there
> will probably be classpath NoClassDefFound issues unless the Hadoop 2
> version is exactly the one you want.
>
> Have I gotten this correct?
>
> It happens that our little app is using a Spark context directly from a
> Jetty webapp and the classpath differences were/are causing some confusion.
> We are currently installing a Hadoop 1 spark master and worker.
>
> Thanks a lot!
> Edward
>


spark 1.3.1 jars in repo1.maven.org

2015-05-19 Thread Edward Sargisson
Hi,
I'd like to confirm an observation I've just made. Specifically that spark
is only available in repo1.maven.org for one Hadoop variant.

The Spark source can be compiled against a number of different Hadoops
using profiles. Yay.
However, the spark jars in repo1.maven.org appear to be compiled against
one specific Hadoop and no other differentiation is made. (I can see a
difference with hadoop-client being 2.2.0 in repo1.maven.org and 1.0.4 in
the version I compiled locally).

The implication here is that if you have a pom file asking for
spark-core_2.10 version 1.3.1 then Maven will only give you an Hadoop 2
version. Maven assumes that non-snapshot artifacts never change so trying
to load an Hadoop 1 version will end in tears.

This then means that if you compile code against spark-core then there will
probably be classpath NoClassDefFound issues unless the Hadoop 2 version is
exactly the one you want.

Have I gotten this correct?

It happens that our little app is using a Spark context directly from a
Jetty webapp and the classpath differences were/are causing some confusion.
We are currently installing a Hadoop 1 spark master and worker.

Thanks a lot!
Edward


RE: Spark 1.3.1 Performance Tuning/Patterns for Data Generation Heavy/Throughput Jobs

2015-05-19 Thread Evo Eftimov
Is that a Spark or Spark Streaming application 

 

Re the map transformation which is required you can also try flatMap

 

Finally an Executor is essentially a JVM spawn by a Spark Worker Node or YARN – 
giving 60GB RAM to a single JVM will certainly result in “off the charts” GC. I 
would suggest to experiment with the following two things:

 

1.   Give less RAM to each Executor but have more Executor including more 
than one Executor per Node especially if the ratio RAM to CPU Cores is favorable

2.   Use Memory Serialized RDDs – this will store them still in RAM but in 
Java Object Serialized form and Spark uses Tachion for that purpose – a 
distributed In Memory File System – and it is Off the JVM Heap and hence avoids 
GC 

 

From: Night Wolf [mailto:nightwolf...@gmail.com] 
Sent: Tuesday, May 19, 2015 9:36 AM
To: user@spark.apache.org
Subject: Spark 1.3.1 Performance Tuning/Patterns for Data Generation 
Heavy/Throughput Jobs

 

Hi all,

 

I have a job that, for every row, creates about 20 new objects (i.e. RDD of 100 
rows in = RDD 2000 rows out). The reason for this is each row is tagged with a 
list of the 'buckets' or 'windows' it belongs to. 

 

The actual data is about 10 billion rows. Each executor has 60GB of memory.

 

Currently I have a mapPartitions task that is doing this object creation in a 
Scala Map and then returning the HashMap as an iterator via .toIterator. 

 

Is there a more efficient way to do this (assuming I can't use something like 
flatMap). 

 

The job runs (assuming each task size is small enough). But the GC time is 
understandably off the charts. 

 

I've reduced the spark cache memory percentage to 0.05 (as I just need space 
for a few broadcasts and this is a data churn task). I've left the shuffle 
memory percent unchanged. 

 

What kinds of settings should I be tuning with regards to GC? 

 

Looking at 
https://spark-summit.org/2014/wp-content/uploads/2015/03/SparkSummitEast2015-AdvDevOps-StudentSlides.pdf
 slide 125 recommends some settings but I'm not sure what would be best here). 
I tried using -XX:+UseG1GC but it pretty much causes my job to fail (all the 
executors die). Are there any tips with respect to the ratio of new gen and old 
gen space when creating lots of objects which will live in a data structure 
until the entire partition is processed? 

 

Any tips for tuning these kinds of jobs would be helpful!

 

Thanks,

~N

 



Spark 1.3.1 Performance Tuning/Patterns for Data Generation Heavy/Throughput Jobs

2015-05-19 Thread Night Wolf
Hi all,

I have a job that, for every row, creates about 20 new objects (i.e. RDD of
100 rows in = RDD 2000 rows out). The reason for this is each row is tagged
with a list of the 'buckets' or 'windows' it belongs to.

The actual data is about 10 billion rows. Each executor has 60GB of memory.

Currently I have a mapPartitions task that is doing this object creation in
a Scala Map and then returning the HashMap as an iterator via .toIterator.

Is there a more efficient way to do this (assuming I can't use something
like flatMap).

The job runs (assuming each task size is small enough). But the GC time is
understandably off the charts.

I've reduced the spark cache memory percentage to 0.05 (as I just need
space for a few broadcasts and this is a data churn task). I've left the
shuffle memory percent unchanged.

What kinds of settings should I be tuning with regards to GC?

Looking at
https://spark-summit.org/2014/wp-content/uploads/2015/03/SparkSummitEast2015-AdvDevOps-StudentSlides.pdf
slide
125 recommends some settings but I'm not sure what would be best here). I
tried using -XX:+UseG1GC but it pretty much causes my job to fail (all the
executors die). Are there any tips with respect to the ratio of new gen and
old gen space when creating lots of objects which will live in a data
structure until the entire partition is processed?

Any tips for tuning these kinds of jobs would be helpful!

Thanks,
~N


Hive partition table + read using hiveContext + spark 1.3.1

2015-05-14 Thread SamyaMaiti
Hi Team,

I have a hive partition table with partition column having spaces.

When I try to run any query, say a simple "Select * from table_name", it
fails.

*Please note the same was working in spark 1.2.0, now I have upgraded to
1.3.1. Also there is no change in my application code base.*

If I give a partition column without spaces, all works fine.

Please provide your inputs.

Regards,
Sam



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Hive-partition-table-read-using-hiveContext-spark-1-3-1-tp22894.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SparkSQL issue: Spark 1.3.1 + hadoop 2.6 on CDH5.3 with parquet

2015-05-07 Thread felicia
Hi all,

Thanks for the help on this case!
we finally settle this by adding a jar named: parquet-hive-bundle-1.5.0.jar
when submitting jobs through spark-submit,
where this jar file does not exist in our CDH5.3 anyway (we've downloaded it
from http://mvnrepository.com/artifact/com.twitter/parquet-hive/1.5.0)

hope this helps!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-issue-Spark-1-3-1-hadoop-2-6-on-CDH5-3-with-parquet-tp22808p22810.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SparkSQL issue: Spark 1.3.1 + hadoop 2.6 on CDH5.3 with parquet

2015-05-07 Thread Marcelo Vanzin
On Thu, May 7, 2015 at 7:39 PM, felicia  wrote:

> we tried to add /usr/lib/parquet/lib & /usr/lib/parquet to SPARK_CLASSPATH
> and it doesn't seems to work,
>

To add the jars to the classpath you need to use "/usr/lib/parquet/lib/*",
otherwise you're just adding the directory (and not the files within it).

-- 
Marcelo


SparkSQL issue: Spark 1.3.1 + hadoop 2.6 on CDH5.3 with parquet

2015-05-07 Thread felicia
Hi all,
I'm able to run SparkSQL through python/java and retrieve data from ordinary
table,
but when trying to fetch data from parquet table, following error shows up:\
which is pretty straight-forward indicating that parquet-related class was
not found;
we tried to add /usr/lib/parquet/lib & /usr/lib/parquet to SPARK_CLASSPATH
and it doesn't seems to work,
what is the proper way to solve this issue? Do we have to re-compile Spark
with Parquet-related jars?

Thanks!

ERROR hive.log: error in initSerDe: java.lang.ClassNotFoundException Class
parquet.hive.serde.ParquetHiveSerDe not found
java.lang.ClassNotFoundException: Class parquet.hive.serde.ParquetHiveSerDe
not found
at
org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1980)
at
org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:337)
at
org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:288)
at 
org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:281)
at org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:631)
at 
org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:189)
at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:1017)
at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:950)
at
org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:78)
at
org.apache.spark.sql.hive.HiveContext$$anon$2.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:253)
at
org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137)
at
org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137)
at scala.Option.getOrElse(Option.scala:120)
at
org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:137)
at
org.apache.spark.sql.hive.HiveContext$$anon$2.lookupRelation(HiveContext.scala:253)
at
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$5.applyOrElse(Analyzer.scala:143)
at
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$5.applyOrElse(Analyzer.scala:138)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:162)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:191)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:147)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:162)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:191)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:147)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)
at
org.apache.sp

Re: Spark 1.3.1 and Parquet Partitions

2015-05-07 Thread in4maniac
Hi V, 

I am assuming that each of the three .parquet paths you mentioned have
multiple partitions in them. 

For eg: [/dataset/city=London/data.parquet/part-r-0.parquet,
/dataset/city=London/data.parquet/part-r-1.parquet]

I haven't personally used this with "hdfs", but I've worked with a similar
file strucutre with '=' in "S3". 

And how i get around this is by building a string of all the filepaths
seperated by commas (with NO spaces inbetween). Then I use that string as
the filepath parameter. I think the following adaptation of S3 file access
pattern to HDFS would work

If I want to load 1 file:
sqlcontext.parquetFile( "hdfs://some
ip:8029/dataset/city=London/data.parquet")

If I want to load multiple files (lets say all 3 of them):
sqlcontext.parquetFile( "hdfs://some
ip:8029/dataset/city=London/data.parquet,hdfs://some
ip:8029/dataset/city=NewYork/data.parquet,hdfs://some
ip:8029/dataset/city=Paris/data.parquet")

*** But in the multiple file scenario, the schema of all the files should be
the same

I hope you can use this S3 pattern with HDFS and hope it works !!

Thanks
in4



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-1-and-Parquet-Partitions-tp22792p22801.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark 1.3.1 and Parquet Partitions

2015-05-07 Thread Yana Kadiyska
Here is the JIRA:  https://issues.apache.org/jira/browse/SPARK-3928
Looks like for now you'd have to list the full paths...I don't see a
comment from an official spark committer so still not sure if this is a bug
or design, but it seems to be the current state of affairs.

On Thu, May 7, 2015 at 8:43 AM, yana  wrote:

> I believe this is a regression. Does not work for me either. There is a
> Jira on parquet wildcards which is resolved, I'll see about getting it
> reopened
>
>
> Sent on the new Sprint Network from my Samsung Galaxy S®4.
>
>
>  Original message 
> From: Vaxuki
> Date:05/07/2015 7:38 AM (GMT-05:00)
> To: Olivier Girardot
> Cc: user@spark.apache.org
> Subject: Re: Spark 1.3.1 and Parquet Partitions
>
> Olivier
> Nope. Wildcard extensions don't work I am debugging the code to figure out
> what's wrong I know I am using 1.3.1 for sure
>
> Pardon typos...
>
> On May 7, 2015, at 7:06 AM, Olivier Girardot  wrote:
>
> "hdfs://some ip:8029/dataset/*/*.parquet" doesn't work for you ?
>
> Le jeu. 7 mai 2015 à 03:32, vasuki  a écrit :
>
>> Spark 1.3.1 -
>> i have a parquet file on hdfs partitioned by some string looking like this
>> /dataset/city=London/data.parquet
>> /dataset/city=NewYork/data.parquet
>> /dataset/city=Paris/data.paruqet
>> ….
>>
>> I am trying to get to load it using sqlContext using
>> sqlcontext.parquetFile(
>> "hdfs://some ip:8029/dataset/< what do i put here >
>>
>> No leads so far. is there i can load the partitions ? I am running on
>> cluster and not local..
>> -V
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-1-and-Parquet-Partitions-tp22792.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>


Re: Spark 1.3.1 and Parquet Partitions

2015-05-07 Thread yana
I believe this is a regression. Does not work for me either. There is a Jira on 
parquet wildcards which is resolved, I'll see about getting it reopened


Sent on the new Sprint Network from my Samsung Galaxy S®4.

 Original message From: Vaxuki 
 Date:05/07/2015  7:38 AM  (GMT-05:00) 
To: Olivier Girardot  Cc: 
user@spark.apache.org Subject: Re: Spark 1.3.1 and Parquet 
Partitions 
Olivier 
Nope. Wildcard extensions don't work I am debugging the code to figure out 
what's wrong I know I am using 1.3.1 for sure

Pardon typos...

On May 7, 2015, at 7:06 AM, Olivier Girardot  wrote:

"hdfs://some ip:8029/dataset/*/*.parquet" doesn't work for you ?

Le jeu. 7 mai 2015 à 03:32, vasuki  a écrit :
Spark 1.3.1 -
i have a parquet file on hdfs partitioned by some string looking like this
/dataset/city=London/data.parquet
/dataset/city=NewYork/data.parquet
/dataset/city=Paris/data.paruqet
….

I am trying to get to load it using sqlContext using sqlcontext.parquetFile(
"hdfs://some ip:8029/dataset/< what do i put here >

No leads so far. is there i can load the partitions ? I am running on
cluster and not local..
-V



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-1-and-Parquet-Partitions-tp22792.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark 1.3.1 and Parquet Partitions

2015-05-07 Thread Vaxuki
Olivier 
Nope. Wildcard extensions don't work I am debugging the code to figure out 
what's wrong I know I am using 1.3.1 for sure

Pardon typos...

> On May 7, 2015, at 7:06 AM, Olivier Girardot  wrote:
> 
> "hdfs://some ip:8029/dataset/*/*.parquet" doesn't work for you ?
> 
>> Le jeu. 7 mai 2015 à 03:32, vasuki  a écrit :
>> Spark 1.3.1 -
>> i have a parquet file on hdfs partitioned by some string looking like this
>> /dataset/city=London/data.parquet
>> /dataset/city=NewYork/data.parquet
>> /dataset/city=Paris/data.paruqet
>> ….
>> 
>> I am trying to get to load it using sqlContext using sqlcontext.parquetFile(
>> "hdfs://some ip:8029/dataset/< what do i put here >
>> 
>> No leads so far. is there i can load the partitions ? I am running on
>> cluster and not local..
>> -V
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-1-and-Parquet-Partitions-tp22792.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org


Re: Spark 1.3.1 and Parquet Partitions

2015-05-07 Thread Olivier Girardot
"hdfs://some ip:8029/dataset/*/*.parquet" doesn't work for you ?

Le jeu. 7 mai 2015 à 03:32, vasuki  a écrit :

> Spark 1.3.1 -
> i have a parquet file on hdfs partitioned by some string looking like this
> /dataset/city=London/data.parquet
> /dataset/city=NewYork/data.parquet
> /dataset/city=Paris/data.paruqet
> ….
>
> I am trying to get to load it using sqlContext using
> sqlcontext.parquetFile(
> "hdfs://some ip:8029/dataset/< what do i put here >
>
> No leads so far. is there i can load the partitions ? I am running on
> cluster and not local..
> -V
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-1-and-Parquet-Partitions-tp22792.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Spark 1.3.1 and Parquet Partitions

2015-05-06 Thread vasuki
Spark 1.3.1 - 
i have a parquet file on hdfs partitioned by some string looking like this
/dataset/city=London/data.parquet
/dataset/city=NewYork/data.parquet
/dataset/city=Paris/data.paruqet
….

I am trying to get to load it using sqlContext using sqlcontext.parquetFile(
"hdfs://some ip:8029/dataset/< what do i put here > 

No leads so far. is there i can load the partitions ? I am running on
cluster and not local..
-V



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-1-and-Parquet-Partitions-tp22792.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark 1.3.1

2015-05-04 Thread Deng Ching-Mallete
Hi,

I think you need to import "org.apache.spark.sql.types.DataTypes" instead
of "org.apache.spark.sql.types.DataType" and use that instead to access the
StringType..

HTH,
Deng

On Mon, May 4, 2015 at 9:37 PM, Saurabh Gupta 
wrote:

> I am really new to this but what should I look into maven logs?
>
> I have tried mvn package -X -e
>
> SHould I show the full trace?
>
>
>
> On Mon, May 4, 2015 at 6:54 PM, Driesprong, Fokko 
> wrote:
>
>> Hi Saurabh,
>>
>> Did you check the log of maven?
>>
>> 2015-05-04 15:17 GMT+02:00 Saurabh Gupta :
>>
>>> HI,
>>>
>>> I am trying to build a example code given at
>>>
>>> https://spark.apache.org/docs/latest/sql-programming-guide.html#interoperating-with-rdds
>>>
>>> code is:
>>>
>>> // Import factory methods provided by DataType.import 
>>> org.apache.spark.sql.types.DataType;// Import StructType and 
>>> StructFieldimport org.apache.spark.sql.types.StructType;import 
>>> org.apache.spark.sql.types.StructField;// Import Row.import 
>>> org.apache.spark.sql.Row;
>>> // sc is an existing JavaSparkContext.SQLContext sqlContext = new 
>>> org.apache.spark.sql.SQLContext(sc);
>>> // Load a text file and convert each line to a JavaBean.JavaRDD 
>>> people = sc.textFile("examples/src/main/resources/people.txt");
>>> // The schema is encoded in a stringString schemaString = "name age";
>>> // Generate the schema based on the string of schemaList 
>>> fields = new ArrayList();for (String fieldName: 
>>> schemaString.split(" ")) {
>>>   fields.add(DataType.createStructField(fieldName, DataType.StringType, 
>>> true));}StructType schema = DataType.createStructType(fields);
>>> // Convert records of the RDD (people) to Rows.JavaRDD rowRDD = 
>>> people.map(
>>>   new Function() {
>>> public Row call(String record) throws Exception {
>>>   String[] fields = record.split(",");
>>>   return Row.create(fields[0], fields[1].trim());
>>> }
>>>   });
>>> // Apply the schema to the RDD.DataFrame peopleDataFrame = 
>>> sqlContext.createDataFrame(rowRDD, schema);
>>> // Register the DataFrame as a 
>>> table.peopleDataFrame.registerTempTable("people");
>>> // SQL can be run over RDDs that have been registered as tables.DataFrame 
>>> results = sqlContext.sql("SELECT name FROM people");
>>> // The results of SQL queries are DataFrames and support all the normal RDD 
>>> operations.// The columns of a row in the result can be accessed by 
>>> ordinal.List names = results.map(new Function() {
>>>   public String call(Row row) {
>>> return "Name: " + row.getString(0);
>>>   }}).collect();
>>>
>>> my pom file looks like:
>>>
>>> 
>>> 
>>> org.apache.spark
>>> spark-core_2.10
>>> 1.3.1
>>> 
>>> 
>>> org.apache.spark
>>> spark-sql_2.10
>>> 1.3.1
>>> 
>>> 
>>> org.apache.hbase
>>> hbase
>>> 0.94.0
>>> 
>>>
>>> When I try to mvn package I am getting this issue:
>>> cannot find symbol
>>> [ERROR] symbol:   variable StringType
>>> [ERROR] location: class org.apache.spark.sql.types.DataType
>>>
>>> I have gone through
>>> https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/StringType.html
>>>
>>> What is missing here?
>>>
>>>


Re: spark 1.3.1

2015-05-04 Thread Saurabh Gupta
I am really new to this but what should I look into maven logs?

I have tried mvn package -X -e

SHould I show the full trace?



On Mon, May 4, 2015 at 6:54 PM, Driesprong, Fokko 
wrote:

> Hi Saurabh,
>
> Did you check the log of maven?
>
> 2015-05-04 15:17 GMT+02:00 Saurabh Gupta :
>
>> HI,
>>
>> I am trying to build a example code given at
>>
>> https://spark.apache.org/docs/latest/sql-programming-guide.html#interoperating-with-rdds
>>
>> code is:
>>
>> // Import factory methods provided by DataType.import 
>> org.apache.spark.sql.types.DataType;// Import StructType and 
>> StructFieldimport org.apache.spark.sql.types.StructType;import 
>> org.apache.spark.sql.types.StructField;// Import Row.import 
>> org.apache.spark.sql.Row;
>> // sc is an existing JavaSparkContext.SQLContext sqlContext = new 
>> org.apache.spark.sql.SQLContext(sc);
>> // Load a text file and convert each line to a JavaBean.JavaRDD 
>> people = sc.textFile("examples/src/main/resources/people.txt");
>> // The schema is encoded in a stringString schemaString = "name age";
>> // Generate the schema based on the string of schemaList fields 
>> = new ArrayList();for (String fieldName: schemaString.split(" 
>> ")) {
>>   fields.add(DataType.createStructField(fieldName, DataType.StringType, 
>> true));}StructType schema = DataType.createStructType(fields);
>> // Convert records of the RDD (people) to Rows.JavaRDD rowRDD = 
>> people.map(
>>   new Function() {
>> public Row call(String record) throws Exception {
>>   String[] fields = record.split(",");
>>   return Row.create(fields[0], fields[1].trim());
>> }
>>   });
>> // Apply the schema to the RDD.DataFrame peopleDataFrame = 
>> sqlContext.createDataFrame(rowRDD, schema);
>> // Register the DataFrame as a 
>> table.peopleDataFrame.registerTempTable("people");
>> // SQL can be run over RDDs that have been registered as tables.DataFrame 
>> results = sqlContext.sql("SELECT name FROM people");
>> // The results of SQL queries are DataFrames and support all the normal RDD 
>> operations.// The columns of a row in the result can be accessed by 
>> ordinal.List names = results.map(new Function() {
>>   public String call(Row row) {
>> return "Name: " + row.getString(0);
>>   }}).collect();
>>
>> my pom file looks like:
>>
>> 
>> 
>> org.apache.spark
>> spark-core_2.10
>> 1.3.1
>> 
>> 
>> org.apache.spark
>> spark-sql_2.10
>> 1.3.1
>> 
>> 
>> org.apache.hbase
>> hbase
>> 0.94.0
>> 
>>
>> When I try to mvn package I am getting this issue:
>> cannot find symbol
>> [ERROR] symbol:   variable StringType
>> [ERROR] location: class org.apache.spark.sql.types.DataType
>>
>> I have gone through
>> https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/StringType.html
>>
>> What is missing here?
>>
>>
>


Re: spark 1.3.1

2015-05-04 Thread Driesprong, Fokko
Hi Saurabh,

Did you check the log of maven?

2015-05-04 15:17 GMT+02:00 Saurabh Gupta :

> HI,
>
> I am trying to build a example code given at
>
> https://spark.apache.org/docs/latest/sql-programming-guide.html#interoperating-with-rdds
>
> code is:
>
> // Import factory methods provided by DataType.import 
> org.apache.spark.sql.types.DataType;// Import StructType and 
> StructFieldimport org.apache.spark.sql.types.StructType;import 
> org.apache.spark.sql.types.StructField;// Import Row.import 
> org.apache.spark.sql.Row;
> // sc is an existing JavaSparkContext.SQLContext sqlContext = new 
> org.apache.spark.sql.SQLContext(sc);
> // Load a text file and convert each line to a JavaBean.JavaRDD 
> people = sc.textFile("examples/src/main/resources/people.txt");
> // The schema is encoded in a stringString schemaString = "name age";
> // Generate the schema based on the string of schemaList fields 
> = new ArrayList();for (String fieldName: schemaString.split(" 
> ")) {
>   fields.add(DataType.createStructField(fieldName, DataType.StringType, 
> true));}StructType schema = DataType.createStructType(fields);
> // Convert records of the RDD (people) to Rows.JavaRDD rowRDD = 
> people.map(
>   new Function() {
> public Row call(String record) throws Exception {
>   String[] fields = record.split(",");
>   return Row.create(fields[0], fields[1].trim());
> }
>   });
> // Apply the schema to the RDD.DataFrame peopleDataFrame = 
> sqlContext.createDataFrame(rowRDD, schema);
> // Register the DataFrame as a 
> table.peopleDataFrame.registerTempTable("people");
> // SQL can be run over RDDs that have been registered as tables.DataFrame 
> results = sqlContext.sql("SELECT name FROM people");
> // The results of SQL queries are DataFrames and support all the normal RDD 
> operations.// The columns of a row in the result can be accessed by 
> ordinal.List names = results.map(new Function() {
>   public String call(Row row) {
> return "Name: " + row.getString(0);
>   }}).collect();
>
> my pom file looks like:
>
> 
> 
> org.apache.spark
> spark-core_2.10
> 1.3.1
> 
> 
> org.apache.spark
> spark-sql_2.10
> 1.3.1
> 
> 
> org.apache.hbase
> hbase
> 0.94.0
> 
>
> When I try to mvn package I am getting this issue:
> cannot find symbol
> [ERROR] symbol:   variable StringType
> [ERROR] location: class org.apache.spark.sql.types.DataType
>
> I have gone through
> https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/StringType.html
>
> What is missing here?
>
>


spark 1.3.1

2015-05-04 Thread Saurabh Gupta
HI,

I am trying to build a example code given at
https://spark.apache.org/docs/latest/sql-programming-guide.html#interoperating-with-rdds

code is:

// Import factory methods provided by DataType.import
org.apache.spark.sql.types.DataType;// Import StructType and
StructFieldimport org.apache.spark.sql.types.StructType;import
org.apache.spark.sql.types.StructField;// Import Row.import
org.apache.spark.sql.Row;
// sc is an existing JavaSparkContext.SQLContext sqlContext = new
org.apache.spark.sql.SQLContext(sc);
// Load a text file and convert each line to a
JavaBean.JavaRDD people =
sc.textFile("examples/src/main/resources/people.txt");
// The schema is encoded in a stringString schemaString = "name age";
// Generate the schema based on the string of schemaList
fields = new ArrayList();for (String fieldName:
schemaString.split(" ")) {
  fields.add(DataType.createStructField(fieldName,
DataType.StringType, true));}StructType schema =
DataType.createStructType(fields);
// Convert records of the RDD (people) to Rows.JavaRDD rowRDD = people.map(
  new Function() {
public Row call(String record) throws Exception {
  String[] fields = record.split(",");
  return Row.create(fields[0], fields[1].trim());
}
  });
// Apply the schema to the RDD.DataFrame peopleDataFrame =
sqlContext.createDataFrame(rowRDD, schema);
// Register the DataFrame as a
table.peopleDataFrame.registerTempTable("people");
// SQL can be run over RDDs that have been registered as
tables.DataFrame results = sqlContext.sql("SELECT name FROM people");
// The results of SQL queries are DataFrames and support all the
normal RDD operations.// The columns of a row in the result can be
accessed by ordinal.List names = results.map(new Function() {
  public String call(Row row) {
return "Name: " + row.getString(0);
  }}).collect();

my pom file looks like:



org.apache.spark
spark-core_2.10
1.3.1


org.apache.spark
spark-sql_2.10
1.3.1


org.apache.hbase
hbase
0.94.0


When I try to mvn package I am getting this issue:
cannot find symbol
[ERROR] symbol:   variable StringType
[ERROR] location: class org.apache.spark.sql.types.DataType

I have gone through
https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/StringType.html

What is missing here?


Re: casting timestamp into long fail in Spark 1.3.1

2015-04-30 Thread Michael Armbrust
This looks like a bug.  Mind opening a JIRA?

On Thu, Apr 30, 2015 at 3:49 PM, Justin Yip  wrote:

> After some trial and error, using DataType solves the problem:
>
> df.withColumn("millis", $"eventTime".cast(
> org.apache.spark.sql.types.LongType) * 1000)
>
> Justin
>
> On Thu, Apr 30, 2015 at 3:41 PM, Justin Yip 
> wrote:
>
>> Hello,
>>
>> I was able to cast a timestamp into long using
>> df.withColumn("millis", $"eventTime".cast("long") * 1000)
>> in spark 1.3.0.
>>
>> However, this statement returns a failure with spark 1.3.1. I got the
>> following exception:
>>
>> Exception in thread "main" org.apache.spark.sql.types.DataTypeException:
>> Unsupported dataType: long. If you have a struct and a field name of it has
>> any special characters, please use backticks (`) to quote that field name,
>> e.g. `x+y`. Please note that backtick itself is not supported in a field
>> name.
>>
>> at
>> org.apache.spark.sql.types.DataTypeParser$class.toDataType(DataTypeParser.scala:95)
>>
>> at
>> org.apache.spark.sql.types.DataTypeParser$$anon$1.toDataType(DataTypeParser.scala:107)
>>
>> at
>> org.apache.spark.sql.types.DataTypeParser$.apply(DataTypeParser.scala:111)
>>
>> at org.apache.spark.sql.Column.cast(Column.scala:636)
>>
>> Is there any change in the casting logic which may lead to this failure?
>>
>> Thanks.
>>
>> Justin
>>
>> --
>> View this message in context: casting timestamp into long fail in Spark
>> 1.3.1
>> <http://apache-spark-user-list.1001560.n3.nabble.com/casting-timestamp-into-long-fail-in-Spark-1-3-1-tp22727.html>
>> Sent from the Apache Spark User List mailing list archive
>> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>>
>
>


Re: casting timestamp into long fail in Spark 1.3.1

2015-04-30 Thread Justin Yip
After some trial and error, using DataType solves the problem:

df.withColumn("millis", $"eventTime".cast(
org.apache.spark.sql.types.LongType) * 1000)

Justin

On Thu, Apr 30, 2015 at 3:41 PM, Justin Yip  wrote:

> Hello,
>
> I was able to cast a timestamp into long using
> df.withColumn("millis", $"eventTime".cast("long") * 1000)
> in spark 1.3.0.
>
> However, this statement returns a failure with spark 1.3.1. I got the
> following exception:
>
> Exception in thread "main" org.apache.spark.sql.types.DataTypeException:
> Unsupported dataType: long. If you have a struct and a field name of it has
> any special characters, please use backticks (`) to quote that field name,
> e.g. `x+y`. Please note that backtick itself is not supported in a field
> name.
>
> at
> org.apache.spark.sql.types.DataTypeParser$class.toDataType(DataTypeParser.scala:95)
>
> at
> org.apache.spark.sql.types.DataTypeParser$$anon$1.toDataType(DataTypeParser.scala:107)
>
> at
> org.apache.spark.sql.types.DataTypeParser$.apply(DataTypeParser.scala:111)
>
> at org.apache.spark.sql.Column.cast(Column.scala:636)
>
> Is there any change in the casting logic which may lead to this failure?
>
> Thanks.
>
> Justin
>
> --
> View this message in context: casting timestamp into long fail in Spark
> 1.3.1
> <http://apache-spark-user-list.1001560.n3.nabble.com/casting-timestamp-into-long-fail-in-Spark-1-3-1-tp22727.html>
> Sent from the Apache Spark User List mailing list archive
> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>


casting timestamp into long fail in Spark 1.3.1

2015-04-30 Thread Justin Yip
Hello,

I was able to cast a timestamp into long using
df.withColumn("millis", $"eventTime".cast("long") * 1000)
in spark 1.3.0.

However, this statement returns a failure with spark 1.3.1. I got the
following exception:

Exception in thread "main" org.apache.spark.sql.types.DataTypeException:
Unsupported dataType: long. If you have a struct and a field name of it has
any special characters, please use backticks (`) to quote that field name,
e.g. `x+y`. Please note that backtick itself is not supported in a field
name.

at
org.apache.spark.sql.types.DataTypeParser$class.toDataType(DataTypeParser.scala:95)

at
org.apache.spark.sql.types.DataTypeParser$$anon$1.toDataType(DataTypeParser.scala:107)

at
org.apache.spark.sql.types.DataTypeParser$.apply(DataTypeParser.scala:111)

at org.apache.spark.sql.Column.cast(Column.scala:636)

Is there any change in the casting logic which may lead to this failure?

Thanks.

Justin




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/casting-timestamp-into-long-fail-in-Spark-1-3-1-tp22727.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark 1.3.1 Hadoop 2.4 Prebuilt package broken ?

2015-04-28 Thread ๏̯͡๏
Worked now.




On Mon, Apr 27, 2015 at 10:20 PM, Sean Owen  wrote:

> Works fine for me. Make sure you're not downloading the HTML
> redirector page and thinking it's the archive.
>
> On Mon, Apr 27, 2015 at 11:43 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) 
> wrote:
> > I downloaded 1.3.1 hadoop 2.4 prebuilt package (tar) from multiple
> mirrors
> > and direct link. Each time i untar i get below error
> >
> > spark-1.3.1-bin-hadoop2.4/lib/spark-assembly-1.3.1-hadoop2.4.0.jar:
> (Empty
> > error message)
> >
> > tar: Error exit delayed from previous errors
> >
> >
> > Is it broken ?
> >
> >
> > --
> > Deepak
> >
>



-- 
Deepak


Re: Spark 1.3.1 JavaStreamingContext - fileStream compile error

2015-04-27 Thread Akhil Das
How about:

JavaPairDStream input =
  jssc.fileStream(inputDirectory, LongWritable.class, Text.class,
TextInputFormat.class);

See the complete example over here


Thanks
Best Regards

On Tue, Apr 28, 2015 at 11:32 AM, lokeshkumar  wrote:

>
> Hi Forum
>
> I am facing below compile error when using the fileStream method of the
> JavaStreamingContext class.
> I have copied the code from JavaAPISuite.java test class of spark test
> code.
>
> The error message is
>
>
> 
> The method fileStream(String, Class, Class, Class,
> Function, boolean) in the type JavaStreamingContext is not
> applicable for the arguments (String, Class, Class,
> Class, new Function(){}, boolean)
>
> 
>
> Please help me to find a solution for this.
>
> 
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-1-JavaStreamingContext-fileStream-compile-error-tp22683.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Spark 1.3.1 JavaStreamingContext - fileStream compile error

2015-04-27 Thread lokeshkumar

Hi Forum 

I am facing below compile error when using the fileStream method of the
JavaStreamingContext class. 
I have copied the code from JavaAPISuite.java test class of spark test code. 

The error message is 


 
The method fileStream(String, Class, Class, Class,
Function, boolean) in the type JavaStreamingContext is not
applicable for the arguments (String, Class, Class,
Class, new Function(){}, boolean) 

 

Please help me to find a solution for this. 

 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-1-JavaStreamingContext-fileStream-compile-error-tp22683.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark 1.3.1 JavaStreamingContext - fileStream compile error

2015-04-27 Thread lokeshkumar
Hi Forum

I am facing below compile error when using the fileStream method of the
JavaStreamingContext class.
I have copied the code from JavaAPISuite.java test class of spark test code.

Please help me to find a solution for this.

 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-1-JavaStreamingContext-fileStream-compile-error-tp22679.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark 1.3.1 Hadoop 2.4 Prebuilt package broken ?

2015-04-27 Thread Sean Owen
Works fine for me. Make sure you're not downloading the HTML
redirector page and thinking it's the archive.

On Mon, Apr 27, 2015 at 11:43 AM, ÐΞ€ρ@Ҝ (๏̯͡๏)  wrote:
> I downloaded 1.3.1 hadoop 2.4 prebuilt package (tar) from multiple mirrors
> and direct link. Each time i untar i get below error
>
> spark-1.3.1-bin-hadoop2.4/lib/spark-assembly-1.3.1-hadoop2.4.0.jar: (Empty
> error message)
>
> tar: Error exit delayed from previous errors
>
>
> Is it broken ?
>
>
> --
> Deepak
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Spark 1.3.1 Hadoop 2.4 Prebuilt package broken ?

2015-04-27 Thread Ganelin, Ilya
What command are you using to untar? Are you running out of disk space?



Sent with Good (www.good.com)


-Original Message-
From: ÐΞ€ρ@Ҝ (๏̯͡๏) [deepuj...@gmail.com<mailto:deepuj...@gmail.com>]
Sent: Monday, April 27, 2015 11:44 AM Eastern Standard Time
To: user
Subject: Spark 1.3.1 Hadoop 2.4 Prebuilt package broken ?

I downloaded 1.3.1 hadoop 2.4 prebuilt package (tar) from multiple mirrors and 
direct link. Each time i untar i get below error


spark-1.3.1-bin-hadoop2.4/lib/spark-assembly-1.3.1-hadoop2.4.0.jar: (Empty 
error message)

tar: Error exit delayed from previous errors


Is it broken ?

--
Deepak



The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed.  If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.


Spark 1.3.1 Hadoop 2.4 Prebuilt package broken ?

2015-04-27 Thread ๏̯͡๏
I downloaded 1.3.1 hadoop 2.4 prebuilt package (tar) from multiple mirrors
and direct link. Each time i untar i get below error

spark-1.3.1-bin-hadoop2.4/lib/spark-assembly-1.3.1-hadoop2.4.0.jar: (Empty
error message)

tar: Error exit delayed from previous errors


Is it broken ?

-- 
Deepak


Re: Re: Is the Spark-1.3.1 support build with scala 2.8 ?

2015-04-23 Thread guoqing0...@yahoo.com.hk
Thank you very much for your suggestion.

Regards,
 
From: madhu phatak
Date: 2015-04-24 13:06
To: guoqing0...@yahoo.com.hk
CC: user
Subject: Re: Is the Spark-1.3.1 support build with scala 2.8 ?
Hi,
AFAIK it's only build with 2.10 and 2.11.  You should integrate 
kafka_2.10.0-0.8.0 to make it work.




Regards,
Madhukara Phatak
http://datamantra.io/

On Fri, Apr 24, 2015 at 9:22 AM, guoqing0...@yahoo.com.hk 
 wrote:
Is the Spark-1.3.1 support build with scala 2.8 ?  Wether it can integrated 
with kafka_2.8.0-0.8.0 If build with scala 2.10 . 

Thanks.



Re: Is the Spark-1.3.1 support build with scala 2.8 ?

2015-04-23 Thread madhu phatak
Hi,
AFAIK it's only build with 2.10 and 2.11.  You should integrate
kafka_2.10.0-0.8.0
to make it work.




Regards,
Madhukara Phatak
http://datamantra.io/

On Fri, Apr 24, 2015 at 9:22 AM, guoqing0...@yahoo.com.hk <
guoqing0...@yahoo.com.hk> wrote:

> Is the Spark-1.3.1 support build with scala 2.8 ?  Wether it can
> integrated with kafka_2.8.0-0.8.0 If build with scala 2.10 .
>
> Thanks.
>


Is the Spark-1.3.1 support build with scala 2.8 ?

2015-04-23 Thread guoqing0...@yahoo.com.hk
Is the Spark-1.3.1 support build with scala 2.8 ?  Wether it can integrated 
with kafka_2.8.0-0.8.0 If build with scala 2.10 . 

Thanks.


  1   2   >