Re: Save DataFrame to Hive Table

2016-02-29 Thread Jeff Zhang
The following line does not execute the sql so the table is not created.
Add .show() at the end to execute the sql.

hiveContext.sql("CREATE TABLE IF NOT EXISTS TableName (key INT, value
STRING)")

On Tue, Mar 1, 2016 at 2:22 PM, Yogesh Vyas  wrote:

> Hi,
>
> I have created a DataFrame in Spark, now I want to save it directly
> into the hive table. How to do it.?
>
> I have created the hive table using following hiveContext:
>
> HiveContext hiveContext = new org.apache.spark.sql.hive.HiveContext(sc.sc
> ());
> hiveContext.sql("CREATE TABLE IF NOT EXISTS TableName (key
> INT, value STRING)");
>
> I am using the following to save it into hive:
> DataFrame.write().mode(SaveMode.Append).insertInto("TableName");
>
> But it gives the error:
> Exception in thread "main" java.lang.RuntimeException: Table Not
> Found: TableName
> at scala.sys.package$.error(package.scala:27)
> at
> org.apache.spark.sql.catalyst.analysis.SimpleCatalog.lookupRelation(Catalog.scala:139)
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.getTable(Analyzer.scala:257)
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$7.applyOrElse(Analyzer.scala:266)
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$7.applyOrElse(Analyzer.scala:264)
> at
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
> at
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
> at
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
> at
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:56)
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:264)
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:254)
> at
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:83)
> at
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:80)
> at
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
> at scala.collection.immutable.List.foldLeft(List.scala:84)
> at
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:80)
> at
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:72)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:72)
> at
> org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:916)
> at
> org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:916)
> at
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
> at
> org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:918)
> at
> org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:917)
> at
> org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:921)
> at
> org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:921)
> at
> org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:926)
> at
> org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:924)
> at
> org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:930)
> at
> org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:930)
> at
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933)
> at
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933)
> at
> org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:176)
> at
> org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:164)
> at com.honeywell.Track.combine.App.main(App.java:451)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
> at
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
> at
> org.apache.spark.deploy.SparkSubmit$.submit(Sp

Save DataFrame to Hive Table

2016-02-29 Thread Yogesh Vyas
Hi,

I have created a DataFrame in Spark, now I want to save it directly
into the hive table. How to do it.?

I have created the hive table using following hiveContext:

HiveContext hiveContext = new org.apache.spark.sql.hive.HiveContext(sc.sc());
hiveContext.sql("CREATE TABLE IF NOT EXISTS TableName (key
INT, value STRING)");

I am using the following to save it into hive:
DataFrame.write().mode(SaveMode.Append).insertInto("TableName");

But it gives the error:
Exception in thread "main" java.lang.RuntimeException: Table Not
Found: TableName
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.catalyst.analysis.SimpleCatalog.lookupRelation(Catalog.scala:139)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.getTable(Analyzer.scala:257)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$7.applyOrElse(Analyzer.scala:266)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$7.applyOrElse(Analyzer.scala:264)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:56)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:264)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:254)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:83)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:80)
at 
scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
at scala.collection.immutable.List.foldLeft(List.scala:84)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:80)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:72)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:72)
at 
org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:916)
at 
org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:916)
at 
org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
at 
org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:918)
at 
org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:917)
at 
org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:921)
at 
org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:921)
at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:926)
at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:924)
at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:930)
at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:930)
at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933)
at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933)
at 
org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:176)
at 
org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:164)
at com.honeywell.Track.combine.App.main(App.java:451)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark Streaming: java.lang.StackOverflowError

2016-02-29 Thread Vinti Maheshwari
Hi All,

I am getting below error in spark-streaming application, i am using kafka
for input stream. When i was doing with socket, it was working fine. But
when i changed to kafka it's giving error. Anyone has idea why it's
throwing error, do i need to change my batch time and check pointing time?



*ERROR StreamingContext: Error starting the context, marking it as
stoppedjava.lang.StackOverflowError*

My program:

def main(args: Array[String]): Unit = {

// Function to create and setup a new StreamingContext
def functionToCreateContext(): StreamingContext = {
  val conf = new SparkConf().setAppName("HBaseStream")
  val sc = new SparkContext(conf)
  // create a StreamingContext, the main entry point for all
streaming functionality
  val ssc = new StreamingContext(sc, Seconds(5))
  val brokers = args(0)
  val topics= args(1)
  val topicsSet = topics.split(",").toSet
  val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
  val messages = KafkaUtils.createDirectStream[String, String,
StringDecoder, StringDecoder](
ssc, kafkaParams, topicsSet)

  val inputStream = messages.map(_._2)
//val inputStream = ssc.socketTextStream(args(0), args(1).toInt)
  ssc.checkpoint(checkpointDirectory)
  inputStream.print(1)
  val parsedStream = inputStream
.map(line => {
  val splitLines = line.split(",")
  (splitLines(1), splitLines.slice(2,
splitLines.length).map((_.trim.toLong)))
})
  import breeze.linalg.{DenseVector => BDV}
  import scala.util.Try

  val state: DStream[(String, Array[Long])] = parsedStream.updateStateByKey(
(current: Seq[Array[Long]], prev: Option[Array[Long]]) =>  {
  prev.map(_ +: current).orElse(Some(current))
.flatMap(as => Try(as.map(BDV(_)).reduce(_ + _).toArray).toOption)
})

  state.checkpoint(Duration(1))
  state.foreachRDD(rdd => rdd.foreach(Blaher.blah))
  ssc
}
// Get StreamingContext from checkpoint data or create a new one
val context = StreamingContext.getOrCreate(checkpointDirectory,
functionToCreateContext _)
  }
}

Regards,
~Vinti


Re: [ERROR]: Spark 1.5.2 + Hbase 1.1 + Hive 1.2 + HbaseIntegration

2016-02-29 Thread Ted Yu
16/02/29 23:09:34 INFO ZooKeeper: Initiating client connection,
connectString=localhost:2181 sessionTimeout=9
watcher=hconnection-0x26fa89a20x0, quorum=localhost:2181, baseZNode=/hbase

Since baseZNode didn't match what you set in hbase-site.xml, the cause was
likely that hbase-site.xml being inaccessible to your Spark job.

Please add it in your classpath.

On Mon, Feb 29, 2016 at 8:42 PM, Ted Yu  wrote:

> 16/02/29 23:09:34 INFO ClientCnxn: Opening socket connection to server
> localhost/0:0:0:0:0:0:0:1:2181. Will not attempt to authenticate using
> SASL (unknown error)
>
> Is your cluster secure cluster ?
>
> bq. Trace :
>
> Was there any output after 'Trace :' ?
>
> Was hbase-site.xml accessible to your Spark job ?
>
> Thanks
>
> On Mon, Feb 29, 2016 at 8:27 PM, Divya Gehlot 
> wrote:
>
>> Hi,
>> I am getting error when I am trying to connect hive table (which is being
>> created through HbaseIntegration) in spark
>>
>> Steps I followed :
>> *Hive Table creation code  *:
>> CREATE EXTERNAL TABLE IF NOT EXISTS TEST(NAME STRING,AGE INT)
>> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
>> WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,0:AGE")
>> TBLPROPERTIES ("hbase.table.name" = "TEST",
>> "hbase.mapred.output.outputtable" = "TEST");
>>
>>
>> *DESCRIBE TEST ;*
>> col_namedata_typecomment
>> namestring from deserializer
>> age   int from deserializer
>>
>>
>> *Spark Code :*
>> import org.apache.spark._
>> import org.apache.spark.sql._
>>
>> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>> hiveContext.sql("from TEST SELECT  NAME").collect.foreach(println)
>>
>>
>> *Starting Spark shell*
>> spark-shell --jars
>> /usr/hdp/2.3.4.0-3485/hive/lib/guava-14.0.1.jar,/usr/hdp/2.3.4.0-3485/hive/lib/hive-hbase-handler.jar,/usr/hdp/2.3.4.0-3485/hbase/lib/hbase-client.jar,/usr/hdp/2.3.4.0-3485/hbase/lib/hbase-common.jar,/usr/hdp/2.3.4.0-3485/hbase/lib/hbase-protocol.jar,/usr/hdp/2.3.4.0-3485/hbase/lib/hbase-server.jar,/usr/hdp/2.3.4.0-3485/hive/lib/hive-hbase-handler.jar,/usr/hdp/2.3.4.0-3485/hive/lib/htrace-core-3.1.0-incubating.jar,/usr/hdp/2.3.4.0-3485/hive/lib/zookeeper-3.4.6.2.3.4.0-3485.jar,/usr/hdp/2.3.4.0-3485/hbase/lib/hbase-server.jar
>> --driver-class-path
>> /usr/hdp/2.3.4.0-3485/hive/lib/guava-14.0.1.jar,/usr/hdp/2.3.4.0-3485/hbase/lib/hbase-client.jar,/usr/hdp/2.3.4.0-3485/hbase/lib/hbase-common.jar,/usr/hdp/2.3.4.0-3485/hbase/lib/hbase-protocol.jar,/usr/hdp/2.3.4.0-3485/hbase/lib/hbase-server.jar,/usr/hdp/2.3.4.0-3485/hive/lib/hive-hbase-handler.jar,/usr/hdp/2.3.4.0-3485/hive/lib/htrace-core-3.1.0-incubating.jar,/usr/hdp/2.3.4.0-3485/hive/lib/zookeeper-3.4.6.2.3.4.0-3485.jar,/usr/hdp/2.3.4.0-3485/hive/lib/hive-hbase-handler.jar,/usr/hdp/2.3.4.0-3485/hbase/lib/hbase-server.jar
>> --packages com.databricks:spark-csv_2.10:1.3.0  --master yarn-client -i
>> /TestDivya/Spark/InstrumentCopyToHDFSHive.scala
>>
>> *Stack Trace* :
>>
>> Stack SQL context available as sqlContext.
>>> Loading /TestDivya/Spark/InstrumentCopyToHDFSHive.scala...
>>> import org.apache.spark._
>>> import org.apache.spark.sql._
>>> 16/02/29 23:09:29 INFO HiveContext: Initializing execution hive, version
>>> 1.2.1
>>> 16/02/29 23:09:29 INFO ClientWrapper: Inspected Hadoop version:
>>> 2.7.1.2.3.4.0-3485
>>> 16/02/29 23:09:29 INFO ClientWrapper: Loaded
>>> org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version
>>> 2.7.1.2.3.4.0-3485
>>> 16/02/29 23:09:29 INFO HiveContext: default warehouse location is
>>> /user/hive/warehouse
>>> 16/02/29 23:09:29 INFO HiveContext: Initializing HiveMetastoreConnection
>>> version 1.2.1 using Spark classes.
>>> 16/02/29 23:09:29 INFO ClientWrapper: Inspected Hadoop version:
>>> 2.7.1.2.3.4.0-3485
>>> 16/02/29 23:09:29 INFO ClientWrapper: Loaded
>>> org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version
>>> 2.7.1.2.3.4.0-3485
>>> 16/02/29 23:09:30 WARN NativeCodeLoader: Unable to load native-hadoop
>>> library for your platform... using builtin-java classes where applicable
>>> 16/02/29 23:09:30 INFO metastore: Trying to connect to metastore with
>>> URI thrift://ip-xxx-xx-xx-xxx.ap-southeast-1.compute.internal:9083
>>> 16/02/29 23:09:30 INFO metastore: Connected to metastore.
>>> 16/02/29 23:09:30 WARN DomainSocketFactory: The short-circuit local
>>> reads feature cannot be used because libhadoop cannot be loaded.
>>> 16/02/29 23:09:31 INFO SessionState: Created local directory:
>>> /tmp/1bf53785-f7c8-406d-a733-a5858ccb2d16_resources
>>> 16/02/29 23:09:31 INFO SessionState: Created HDFS directory:
>>> /tmp/hive/hdfs/1bf53785-f7c8-406d-a733-a5858ccb2d16
>>> 16/02/29 23:09:31 INFO SessionState: Created local directory:
>>> /tmp/hdfs/1bf53785-f7c8-406d-a733-a5858ccb2d16
>>> 16/02/29 23:09:31 INFO SessionState: Created HDFS directory:
>>> /tmp/hive/hdfs/1bf53785-f7c8-406d-a733-a5858ccb2d16/_tmp_space.db
>>> hiveContext: org.apache.spark.sql.hive.HiveContext =
>>> org.apache.spark.sql.hive.HiveContext@10

Update edge weight in graphx

2016-02-29 Thread naveen.marri
Hi,
   
 I'm trying to implement an algorithm using graphx which involves
updating edge weight during every iteration. the format is
[Node]-[Node]--[Weight]
  Ex: 
  I checked in docs of graphx but didn't find any resources to change
the weight of the edge for a same RDD 
 I know RDDs are immutable , is there any way to do this in graphx
 Also is there any way to dynamically add vertices and edges to the
graph within same RDD?

 Regards,
 Naveen



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Update-edge-weight-in-graphx-tp26367.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Support virtualenv in PySpark

2016-02-29 Thread Jeff Zhang
I have created jira for this feature , comments and feedback are welcome
about how to improve it and whether it's valuable for users.

https://issues.apache.org/jira/browse/SPARK-13587


Here's some background info and status of this work.


Currently, it's not easy for user to add third party python packages in
pyspark.

   - One way is to using --py-files (suitable for simple dependency, but
   not suitable for complicated dependency, especially with transitive
   dependency)
   - Another way is install packages manually on each node (time wasting,
   and not easy to switch to different environment)

Python now has 2 different virtualenv implementation. One is native
virtualenv another is through conda.

I have implemented POC for this features. Here's one simple command for how
to use virtualenv in pyspark

bin/spark-submit --master yarn --deploy-mode client --conf
"spark.pyspark.virtualenv.enabled=true" --conf
"spark.pyspark.virtualenv.type=conda" --conf
"spark.pyspark.virtualenv.requirements=/Users/jzhang/work/virtualenv/conda.txt"
--conf "spark.pyspark.virtualenv.path=/Users/jzhang/anaconda/bin/conda"
 ~/work/virtualenv/spark.py

There're 4 properties needs to be set

   - spark.pyspark.virtualenv.enabled (enable virtualenv)
   - spark.pyspark.virtualenv.type (native/conda are supported, default is
   native)
   - spark.pyspark.virtualenv.requirements (requirement file for the
   dependencies)
   - spark.pyspark.virtualenv.path (path to the executable for for
   virtualenv/conda)






Best Regards

Jeff Zhang


RE: Use maxmind geoip lib to process ip on Spark/Spark Streaming

2016-02-29 Thread Silvio Fiorito

I’ve used the code below with SparkSQL. I was using this with Spark 1.4 but 
should still be good with 1.6. In this case I have a UDF to do the lookup, but 
for Streaming you’d just have a lambda to apply in a map function, so no UDF 
wrapper.

import org.apache.spark.sql.functions._
import java.io.File
import java.net.InetAddress
import com.maxmind.geoip2._

object GeoIPLookup {
@transient lazy val reader = {
val db = new File("/data/meetup/GeoLite2-City.mmdb")

val reader = new DatabaseReader.Builder(db).build()

reader
}
}

case class Location(latitude: Double, longitude: Double)
case class Geo(city: String, country: String, loc: Location)

val iplookup = udf { (s: String) => {
   val ip = InetAddress.getByName(s)

   val record = GeoIPLookup.reader.city(ip)

   val city = record.getCity
   val country = record.getCountry
   val location = record.getLocation

   Geo(city.getName, country.getName, Location(location.getLatitude, 
location.getLongitude))
} }

val withGeo = df.withColumn("geo", iplookup(column("ip")))


From: Zhun Shen
Sent: Monday, February 29, 2016 11:17 PM
To: romain sagean
Cc: user
Subject: Re: Use maxmind geoip lib to process ip on Spark/Spark Streaming

Hi,

I check the dependencies and fix the bug. It work well on Spark but not on 
Spark Streaming. So I think I still need find another way to do it.


On Feb 26, 2016, at 2:47 PM, Zhun Shen 
mailto:shenzhunal...@gmail.com>> wrote:

Hi,

thanks for you advice. I tried your method, I use Gradle to manage my scala 
code. 'com.snowplowanalytics:scala-maxmind-iplookups:0.2.0’ was imported in 
Gradle.

spark version: 1.6.0
scala: 2.10.4
scala-maxmind-iplookups: 0.2.0

I run my test, got the error as below:
java.lang.NoClassDefFoundError: scala/collection/JavaConversions$JMapWrapperLike
at com.snowplowanalytics.maxmind.iplookups.IpLookups$.apply(IpLookups.scala:53)




On Feb 24, 2016, at 1:10 AM, romain sagean 
mailto:romain.sag...@hupi.fr>> wrote:

I realize I forgot the sbt part

resolvers += "SnowPlow Repo" at 
"http://maven.snplow.com/releases/";

libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core" % "1.3.0",
  "com.snowplowanalytics"  %% "scala-maxmind-iplookups"  % "0.2.0"
)

otherwise, to process streaming log I use logstash with kafka as input. You can 
set kafka as output if you need to do some extra calculation with spark.

Le 23/02/2016 15:07, Romain Sagean a écrit :
Hi,
I use maxmind geoip with spark (no streaming). To make it work you should use 
mapPartition. I don't know if something similar exist for spark streaming.

my code for reference:

  def parseIP(ip:String, ipLookups: IpLookups): List[String] = {
val lookupResult = ipLookups.performLookups(ip)
val countryName = (lookupResult._1).map(_.countryName).getOrElse("")
val city = (lookupResult._1).map(_.city).getOrElse(None).getOrElse("")
val latitude = (lookupResult._1).map(_.latitude).getOrElse(None).toString
val longitude = (lookupResult._1).map(_.longitude).getOrElse(None).toString
return List(countryName, city, latitude, longitude)
  }
sc.addFile("/home/your_user/GeoLiteCity.dat")

//load your data in my_data rdd

my_data.mapPartitions { rows =>
val ipLookups = IpLookups(geoFile = 
Some(SparkFiles.get("GeoLiteCity.dat")))
rows.map { row => row ::: parseIP(row(3),ipLookups) }
}

Le 23/02/2016 14:28, Zhun Shen a écrit :
Hi all,

Currently, I sent nginx log to Kafka then I want to use Spark Streaming to 
parse the log and enrich the IP info with geoip libs from Maxmind.

I found this one  
https://github.com/Sanoma-CDA/maxmind-geoip2-scala.git, but spark streaming 
throw error and told that the lib was not Serializable.

Does anyone there way to process the IP info in Spark Streaming? Many thanks.







Re: perl Kafka::Producer, “Kafka::Exception::Producer”, “code”, -1000, “message”, "Invalid argument

2016-02-29 Thread Vinti Maheshwari
Hi Cody,

Sorry, i realized afterwards, i should not ask here. My actual program is
spark-streaming and i used kafka for input streaming.

Thanks,
Vinti

On Mon, Feb 29, 2016 at 1:46 PM, Cody Koeninger  wrote:

> Does this issue involve Spark at all?  Otherwise you may have better luck
> on a perl or kafka related list.
>
> On Mon, Feb 29, 2016 at 3:26 PM, Vinti Maheshwari 
> wrote:
>
>> Hi All,
>>
>> I wrote kafka producer using kafka perl api, But i am getting error when
>> i am passing variable for sending message while if i am hard coding the
>> message data it's not giving any error.
>>
>> Perl program, where i added kafka producer code:
>>
>> try {
>> $kafka_connection = Kafka::Connection->new( host => 
>> $hadoop_server, port => '6667' );
>> $producer = Kafka::Producer->new( Connection => 
>> $kafka_connection );
>> my $topic = 'test1';
>> my $partition = 0;
>> my $message = $hadoop_str;
>> my $response = $producer->send(
>> $topic, # topic
>> $partition,  # partition
>> 
>> #"56b4b2b23c24c3608376d1ea,/obj/i386/ui/lib/access/daemon_map.So.gcda,1,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0"
>>  # message
>> $hadoop_str
>> #"t1,f9,1,1,1"
>> );
>> } catch {
>> my $error = $_;
>> if ( blessed( $error ) && $error->isa( 
>> 'Kafka::Exception' ) ) {
>> warn 'Error: (', $error->code, ') ',  
>> $error->message, "\n";
>> exit;
>> } else {
>> die $error;
>> }
>> };#CCLib::run_system_cmd( $cmd );
>> }
>>
>> Error Log: -bash-3.2$ ./stream_binary_hadoop.pl print (...) interpreted
>> as function at ./stream_binary_hadoop.pl line 429. Invalid argument:
>> message =
>> 56b4b2b23c24c3608376d1ea,/obj/i386/junos/usr.sbin/lmpd/lmpd_repl_msg_idr.gcda,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
>> at
>> /opt/adp/projects/code_coverage/perl//5.10/lib/site_perl/5.10.1/Exception/Class/Base.pm
>> line 85. Exception::Class::Base::throw("Kafka::Exception::Producer",
>> "code", -1000, "message", "Invalid argument: message =
>> 56b4b2b23c24c3608376d1ea,/obj/i38"...) called at
>> /opt/adp/projects/code_coverage/perl//5.10/lib/site_perl/5.10.1/Kafka/Producer.pm
>> line 374 Kafka::Producer::_error(Kafka::Producer=HASH(0x36955f8), -1000,
>> "message = 56b4b2b23c24c3608376d1ea,/obj/i386/junos/usr.sbin/l"...) called
>> at
>> /opt/adp/projects/code_coverage/perl//5.10/lib/site_perl/5.10.1/Kafka/Producer.pm
>> line 331 Kafka::Producer::send(Kafka::Producer=HASH(0x36955f8), "test1", 0,
>> "56b4b2b23c24c3608376d1ea,/obj/i386/junos/usr.sbin/lmpd/lmpd_r"...) called
>> at ./stream_binary_hadoop.pl line 175 main::try {...} () called at
>> /opt/adp/projects/code_coverage/perl//5.10/lib/site_perl/5.10.1/Try/Tiny.pm
>> line 81 eval {...} called at
>> /opt/adp/projects/code_coverage/perl//5.10/lib/site_perl/5.10.1/Try/Tiny.pm
>> line 72 Try::Tiny::try(CODE(0x3692888), Try::Tiny::Catch=REF(0x3692c78))
>> called at ./stream_binary_hadoop.pl line 190
>> main::stream(HASH(0x3692708)) called at ./stream_binary_hadoop.pl line
>> 354 main::file_split(HASH(0x36927b0)) called at ./stream_binary_hadoop.pl
>> line 413
>>
>> at ./stream_binary_hadoop.pl line 188. main::catch {...} (" Invalid
>> argument: message = 56b4b2b23c24c3608376d1e"...) called at
>> /opt/adp/projects/code_coverage/perl//5.10/lib/site_perl/5.10.1/Try/Tiny.pm
>> line 104 Try::Tiny::try(CODE(0x3692888), Try::Tiny::Catch=REF(0x3692c78))
>> called at ./stream_binary_hadoop.pl line 190
>> main::stream(HASH(0x3692708)) called at ./stream_binary_hadoop.pl line
>> 354 main::file_split(HASH(0x36927b0)) called at ./stream_binary_hadoop.pl
>> line 413
>>
>>
>>
>> Thank & Regards,
>>
>> ~Vinti
>>
>
>


Re: [ERROR]: Spark 1.5.2 + Hbase 1.1 + Hive 1.2 + HbaseIntegration

2016-02-29 Thread Ted Yu
16/02/29 23:09:34 INFO ClientCnxn: Opening socket connection to server
localhost/0:0:0:0:0:0:0:1:2181. Will not attempt to authenticate using SASL
(unknown error)

Is your cluster secure cluster ?

bq. Trace :

Was there any output after 'Trace :' ?

Was hbase-site.xml accessible to your Spark job ?

Thanks

On Mon, Feb 29, 2016 at 8:27 PM, Divya Gehlot 
wrote:

> Hi,
> I am getting error when I am trying to connect hive table (which is being
> created through HbaseIntegration) in spark
>
> Steps I followed :
> *Hive Table creation code  *:
> CREATE EXTERNAL TABLE IF NOT EXISTS TEST(NAME STRING,AGE INT)
> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,0:AGE")
> TBLPROPERTIES ("hbase.table.name" = "TEST",
> "hbase.mapred.output.outputtable" = "TEST");
>
>
> *DESCRIBE TEST ;*
> col_namedata_typecomment
> namestring from deserializer
> age   int from deserializer
>
>
> *Spark Code :*
> import org.apache.spark._
> import org.apache.spark.sql._
>
> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
> hiveContext.sql("from TEST SELECT  NAME").collect.foreach(println)
>
>
> *Starting Spark shell*
> spark-shell --jars
> /usr/hdp/2.3.4.0-3485/hive/lib/guava-14.0.1.jar,/usr/hdp/2.3.4.0-3485/hive/lib/hive-hbase-handler.jar,/usr/hdp/2.3.4.0-3485/hbase/lib/hbase-client.jar,/usr/hdp/2.3.4.0-3485/hbase/lib/hbase-common.jar,/usr/hdp/2.3.4.0-3485/hbase/lib/hbase-protocol.jar,/usr/hdp/2.3.4.0-3485/hbase/lib/hbase-server.jar,/usr/hdp/2.3.4.0-3485/hive/lib/hive-hbase-handler.jar,/usr/hdp/2.3.4.0-3485/hive/lib/htrace-core-3.1.0-incubating.jar,/usr/hdp/2.3.4.0-3485/hive/lib/zookeeper-3.4.6.2.3.4.0-3485.jar,/usr/hdp/2.3.4.0-3485/hbase/lib/hbase-server.jar
> --driver-class-path
> /usr/hdp/2.3.4.0-3485/hive/lib/guava-14.0.1.jar,/usr/hdp/2.3.4.0-3485/hbase/lib/hbase-client.jar,/usr/hdp/2.3.4.0-3485/hbase/lib/hbase-common.jar,/usr/hdp/2.3.4.0-3485/hbase/lib/hbase-protocol.jar,/usr/hdp/2.3.4.0-3485/hbase/lib/hbase-server.jar,/usr/hdp/2.3.4.0-3485/hive/lib/hive-hbase-handler.jar,/usr/hdp/2.3.4.0-3485/hive/lib/htrace-core-3.1.0-incubating.jar,/usr/hdp/2.3.4.0-3485/hive/lib/zookeeper-3.4.6.2.3.4.0-3485.jar,/usr/hdp/2.3.4.0-3485/hive/lib/hive-hbase-handler.jar,/usr/hdp/2.3.4.0-3485/hbase/lib/hbase-server.jar
> --packages com.databricks:spark-csv_2.10:1.3.0  --master yarn-client -i
> /TestDivya/Spark/InstrumentCopyToHDFSHive.scala
>
> *Stack Trace* :
>
> Stack SQL context available as sqlContext.
>> Loading /TestDivya/Spark/InstrumentCopyToHDFSHive.scala...
>> import org.apache.spark._
>> import org.apache.spark.sql._
>> 16/02/29 23:09:29 INFO HiveContext: Initializing execution hive, version
>> 1.2.1
>> 16/02/29 23:09:29 INFO ClientWrapper: Inspected Hadoop version:
>> 2.7.1.2.3.4.0-3485
>> 16/02/29 23:09:29 INFO ClientWrapper: Loaded
>> org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version
>> 2.7.1.2.3.4.0-3485
>> 16/02/29 23:09:29 INFO HiveContext: default warehouse location is
>> /user/hive/warehouse
>> 16/02/29 23:09:29 INFO HiveContext: Initializing HiveMetastoreConnection
>> version 1.2.1 using Spark classes.
>> 16/02/29 23:09:29 INFO ClientWrapper: Inspected Hadoop version:
>> 2.7.1.2.3.4.0-3485
>> 16/02/29 23:09:29 INFO ClientWrapper: Loaded
>> org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version
>> 2.7.1.2.3.4.0-3485
>> 16/02/29 23:09:30 WARN NativeCodeLoader: Unable to load native-hadoop
>> library for your platform... using builtin-java classes where applicable
>> 16/02/29 23:09:30 INFO metastore: Trying to connect to metastore with URI
>> thrift://ip-xxx-xx-xx-xxx.ap-southeast-1.compute.internal:9083
>> 16/02/29 23:09:30 INFO metastore: Connected to metastore.
>> 16/02/29 23:09:30 WARN DomainSocketFactory: The short-circuit local reads
>> feature cannot be used because libhadoop cannot be loaded.
>> 16/02/29 23:09:31 INFO SessionState: Created local directory:
>> /tmp/1bf53785-f7c8-406d-a733-a5858ccb2d16_resources
>> 16/02/29 23:09:31 INFO SessionState: Created HDFS directory:
>> /tmp/hive/hdfs/1bf53785-f7c8-406d-a733-a5858ccb2d16
>> 16/02/29 23:09:31 INFO SessionState: Created local directory:
>> /tmp/hdfs/1bf53785-f7c8-406d-a733-a5858ccb2d16
>> 16/02/29 23:09:31 INFO SessionState: Created HDFS directory:
>> /tmp/hive/hdfs/1bf53785-f7c8-406d-a733-a5858ccb2d16/_tmp_space.db
>> hiveContext: org.apache.spark.sql.hive.HiveContext =
>> org.apache.spark.sql.hive.HiveContext@10b14f32
>> 16/02/29 23:09:32 INFO ParseDriver: Parsing command: from TEST SELECT
>>  NAME
>> 16/02/29 23:09:32 INFO ParseDriver: Parse Completed
>> 16/02/29 23:09:33 INFO deprecation: mapred.map.tasks is deprecated.
>> Instead, use mapreduce.job.maps
>> 16/02/29 23:09:33 INFO MemoryStore: ensureFreeSpace(468352) called with
>> curMem=0, maxMem=556038881
>> 16/02/29 23:09:33 INFO MemoryStore: Block broadcast_0 stored as values in
>> memory (estimated size 457.4 KB, free 529.8 MB)
>> 16/02/29 23:09:33 INFO Memory

[ERROR]: Spark 1.5.2 + Hbase 1.1 + Hive 1.2 + HbaseIntegration

2016-02-29 Thread Divya Gehlot
Hi,
I am getting error when I am trying to connect hive table (which is being
created through HbaseIntegration) in spark

Steps I followed :
*Hive Table creation code  *:
CREATE EXTERNAL TABLE IF NOT EXISTS TEST(NAME STRING,AGE INT)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,0:AGE")
TBLPROPERTIES ("hbase.table.name" = "TEST",
"hbase.mapred.output.outputtable" = "TEST");


*DESCRIBE TEST ;*
col_namedata_typecomment
namestring from deserializer
age   int from deserializer


*Spark Code :*
import org.apache.spark._
import org.apache.spark.sql._

val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
hiveContext.sql("from TEST SELECT  NAME").collect.foreach(println)


*Starting Spark shell*
spark-shell --jars
/usr/hdp/2.3.4.0-3485/hive/lib/guava-14.0.1.jar,/usr/hdp/2.3.4.0-3485/hive/lib/hive-hbase-handler.jar,/usr/hdp/2.3.4.0-3485/hbase/lib/hbase-client.jar,/usr/hdp/2.3.4.0-3485/hbase/lib/hbase-common.jar,/usr/hdp/2.3.4.0-3485/hbase/lib/hbase-protocol.jar,/usr/hdp/2.3.4.0-3485/hbase/lib/hbase-server.jar,/usr/hdp/2.3.4.0-3485/hive/lib/hive-hbase-handler.jar,/usr/hdp/2.3.4.0-3485/hive/lib/htrace-core-3.1.0-incubating.jar,/usr/hdp/2.3.4.0-3485/hive/lib/zookeeper-3.4.6.2.3.4.0-3485.jar,/usr/hdp/2.3.4.0-3485/hbase/lib/hbase-server.jar
--driver-class-path
/usr/hdp/2.3.4.0-3485/hive/lib/guava-14.0.1.jar,/usr/hdp/2.3.4.0-3485/hbase/lib/hbase-client.jar,/usr/hdp/2.3.4.0-3485/hbase/lib/hbase-common.jar,/usr/hdp/2.3.4.0-3485/hbase/lib/hbase-protocol.jar,/usr/hdp/2.3.4.0-3485/hbase/lib/hbase-server.jar,/usr/hdp/2.3.4.0-3485/hive/lib/hive-hbase-handler.jar,/usr/hdp/2.3.4.0-3485/hive/lib/htrace-core-3.1.0-incubating.jar,/usr/hdp/2.3.4.0-3485/hive/lib/zookeeper-3.4.6.2.3.4.0-3485.jar,/usr/hdp/2.3.4.0-3485/hive/lib/hive-hbase-handler.jar,/usr/hdp/2.3.4.0-3485/hbase/lib/hbase-server.jar
--packages com.databricks:spark-csv_2.10:1.3.0  --master yarn-client -i
/TestDivya/Spark/InstrumentCopyToHDFSHive.scala

*Stack Trace* :

Stack SQL context available as sqlContext.
> Loading /TestDivya/Spark/InstrumentCopyToHDFSHive.scala...
> import org.apache.spark._
> import org.apache.spark.sql._
> 16/02/29 23:09:29 INFO HiveContext: Initializing execution hive, version
> 1.2.1
> 16/02/29 23:09:29 INFO ClientWrapper: Inspected Hadoop version:
> 2.7.1.2.3.4.0-3485
> 16/02/29 23:09:29 INFO ClientWrapper: Loaded
> org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version
> 2.7.1.2.3.4.0-3485
> 16/02/29 23:09:29 INFO HiveContext: default warehouse location is
> /user/hive/warehouse
> 16/02/29 23:09:29 INFO HiveContext: Initializing HiveMetastoreConnection
> version 1.2.1 using Spark classes.
> 16/02/29 23:09:29 INFO ClientWrapper: Inspected Hadoop version:
> 2.7.1.2.3.4.0-3485
> 16/02/29 23:09:29 INFO ClientWrapper: Loaded
> org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version
> 2.7.1.2.3.4.0-3485
> 16/02/29 23:09:30 WARN NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> 16/02/29 23:09:30 INFO metastore: Trying to connect to metastore with URI
> thrift://ip-xxx-xx-xx-xxx.ap-southeast-1.compute.internal:9083
> 16/02/29 23:09:30 INFO metastore: Connected to metastore.
> 16/02/29 23:09:30 WARN DomainSocketFactory: The short-circuit local reads
> feature cannot be used because libhadoop cannot be loaded.
> 16/02/29 23:09:31 INFO SessionState: Created local directory:
> /tmp/1bf53785-f7c8-406d-a733-a5858ccb2d16_resources
> 16/02/29 23:09:31 INFO SessionState: Created HDFS directory:
> /tmp/hive/hdfs/1bf53785-f7c8-406d-a733-a5858ccb2d16
> 16/02/29 23:09:31 INFO SessionState: Created local directory:
> /tmp/hdfs/1bf53785-f7c8-406d-a733-a5858ccb2d16
> 16/02/29 23:09:31 INFO SessionState: Created HDFS directory:
> /tmp/hive/hdfs/1bf53785-f7c8-406d-a733-a5858ccb2d16/_tmp_space.db
> hiveContext: org.apache.spark.sql.hive.HiveContext =
> org.apache.spark.sql.hive.HiveContext@10b14f32
> 16/02/29 23:09:32 INFO ParseDriver: Parsing command: from TEST SELECT  NAME
> 16/02/29 23:09:32 INFO ParseDriver: Parse Completed
> 16/02/29 23:09:33 INFO deprecation: mapred.map.tasks is deprecated.
> Instead, use mapreduce.job.maps
> 16/02/29 23:09:33 INFO MemoryStore: ensureFreeSpace(468352) called with
> curMem=0, maxMem=556038881
> 16/02/29 23:09:33 INFO MemoryStore: Block broadcast_0 stored as values in
> memory (estimated size 457.4 KB, free 529.8 MB)
> 16/02/29 23:09:33 INFO MemoryStore: ensureFreeSpace(49454) called with
> curMem=468352, maxMem=556038881
> 16/02/29 23:09:33 INFO MemoryStore: Block broadcast_0_piece0 stored as
> bytes in memory (estimated size 48.3 KB, free 529.8 MB)
> 16/02/29 23:09:33 INFO BlockManagerInfo: Added broadcast_0_piece0 in
> memory on xxx.xx.xx.xxx:37784 (size: 48.3 KB, free: 530.2 MB)
> 16/02/29 23:09:33 INFO SparkContext: Created broadcast 0 from collect at
> :30
> 16/02/29 23:09:34 INFO HBaseStorageHandler: Configurin

Re: Use maxmind geoip lib to process ip on Spark/Spark Streaming

2016-02-29 Thread Zhun Shen
Hi,

I check the dependencies and fix the bug. It work well on Spark but not on 
Spark Streaming. So I think I still need find another way to do it.

 
> On Feb 26, 2016, at 2:47 PM, Zhun Shen  wrote:
> 
> Hi,
> 
> thanks for you advice. I tried your method, I use Gradle to manage my scala 
> code. 'com.snowplowanalytics:scala-maxmind-iplookups:0.2.0’ was imported in 
> Gradle.
> 
> spark version: 1.6.0
> scala: 2.10.4
> scala-maxmind-iplookups: 0.2.0
> 
> I run my test, got the error as below:
> java.lang.NoClassDefFoundError: 
> scala/collection/JavaConversions$JMapWrapperLike
>   at 
> com.snowplowanalytics.maxmind.iplookups.IpLookups$.apply(IpLookups.scala:53)
> 
> 
> 
> 
>> On Feb 24, 2016, at 1:10 AM, romain sagean > > wrote:
>> 
>> I realize I forgot the sbt part
>> 
>> resolvers += "SnowPlow Repo" at "http://maven.snplow.com/releases/"; 
>> 
>> 
>> libraryDependencies ++= Seq(
>>   "org.apache.spark" %% "spark-core" % "1.3.0",
>>   "com.snowplowanalytics"  %% "scala-maxmind-iplookups"  % "0.2.0"
>> )
>> 
>> otherwise, to process streaming log I use logstash with kafka as input. You 
>> can set kafka as output if you need to do some extra calculation with spark.
>> 
>> Le 23/02/2016 15:07, Romain Sagean a écrit :
>>> Hi,
>>> I use maxmind geoip with spark (no streaming). To make it work you should 
>>> use mapPartition. I don't know if something similar exist for spark 
>>> streaming.
>>> 
>>> my code for reference:
>>> 
>>>   def parseIP(ip:String, ipLookups: IpLookups): List[String] = {
>>> val lookupResult = ipLookups.performLookups(ip)
>>> val countryName = (lookupResult._1).map(_.countryName).getOrElse("")
>>> val city = (lookupResult._1).map(_.city).getOrElse(None).getOrElse("")
>>> val latitude = 
>>> (lookupResult._1).map(_.latitude).getOrElse(None).toString
>>> val longitude = 
>>> (lookupResult._1).map(_.longitude).getOrElse(None).toString
>>> return List(countryName, city, latitude, longitude)
>>>   }
>>> sc.addFile("/home/your_user/GeoLiteCity.dat")
>>> 
>>> //load your data in my_data rdd
>>> 
>>> my_data.mapPartitions { rows =>
>>> val ipLookups = IpLookups(geoFile = 
>>> Some(SparkFiles.get("GeoLiteCity.dat")))
>>> rows.map { row => row ::: parseIP(row(3),ipLookups) }
>>> }
>>> 
>>> Le 23/02/2016 14:28, Zhun Shen a écrit :
 Hi all,
 
 Currently, I sent nginx log to Kafka then I want to use Spark Streaming to 
 parse the log and enrich the IP info with geoip libs from Maxmind. 
 
 I found this one  
 https://github.com/Sanoma-CDA/maxmind-geoip2-scala.git
  , but spark 
 streaming throw error and told that the lib was not Serializable.
 
 Does anyone there way to process the IP info in Spark Streaming? Many 
 thanks.
 
>>> 
>> 
> 



SaveMode, parquet and S3

2016-02-29 Thread Peter Halliday
I have a system where I’m saving parquet files to S3 via Spark.  They are 
partitioned a couple of ways first by date and then by a partition key.  There 
are multiple parquet files per combination over long period of time.  So the 
structure is like this:

s3://bucketname/date=2016-02-29/partionkey=2342/filename.parquet.gz

There’s been disagreement on how the SaveMode should be used for in saving out 
the data.  If we keep the SaveMode as ErrorIfExists, will that means additional 
partitions or parquet files that are written out later with the same parts of 
the subpath won’t be able to be written successfully?  Also, does the SaveMode 
apply to Tasks too.  Say, we are using the Direct Output Committer, and there’s 
a failure in a task that causes some files to be written and others in the task 
to not be written.  Would it automatically inherit the SaveMode in the 
individual file’s case.  or is the SaveMode only apply to the files as a whole?

Peter Halliday
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Spark UI standalone "crashes" after an application finishes

2016-02-29 Thread Mohammed Guller
I believe the OP is referring to the application UI on port 4040.

The application UI on port 4040 is available only while application is running. 
As per the documentation:
To view the web UI after the fact, set spark.eventLog.enabled to true before 
starting the application. This configures Spark to log Spark events that encode 
the information displayed in the UI to persisted storage.

Mohammed
Author: Big Data Analytics with 
Spark

From: Shixiong(Ryan) Zhu [mailto:shixi...@databricks.com]
Sent: Monday, February 29, 2016 4:03 PM
To: Sumona Routh
Cc: user@spark.apache.org
Subject: Re: Spark UI standalone "crashes" after an application finishes

Do you mean you cannot access Master UI after your application completes? Could 
you check the master log?

On Mon, Feb 29, 2016 at 3:48 PM, Sumona Routh 
mailto:sumos...@gmail.com>> wrote:
Hi there,
I've been doing some performance tuning of our Spark application, which is 
using Spark 1.2.1 standalone. I have been using the spark metrics to graph out 
details as I run the jobs, as well as the UI to review the tasks and stages.
I notice that after my application completes, or is near completion, the UI 
"crashes." I get a Connection Refused response. Sometimes, the page eventually 
recovers and will load again, but sometimes I end up having to restart the 
Spark master to get it back. When I look at my graphs on the app, the memory 
consumption (of driver, executors, and what I believe to be the daemon 
(spark.jvm.total.used)) appears to be healthy. Monitoring the master machine 
itself, memory and CPU appear healthy as well.
Has anyone else seen this issue? Are there logs for the UI itself, and where 
might I find those?
Thanks!
Sumona



?????? Spark UI standalone "crashes" after an application finishes

2016-02-29 Thread Sea
Hi, Sumona:
  It's a bug in Spark old version, In spark 1.6.0, it is fixed.
  After the application complete, spark master will load event log to 
memory, and it is sync because of actor. If the event log is big, spark master 
will hang a long time, and you can not submit any applications, if your master 
memory is to small, you master will die!
  The solution in spark 1.6 is not very good, the operation is async, and 
so you still need to set a big java heap for master.






--  --
??: "Shixiong(Ryan) Zhu";;
: 2016??3??1??(??) 8:02
??: "Sumona Routh"; 
: "user@spark.apache.org"; 
: Re: Spark UI standalone "crashes" after an application finishes



Do you mean you cannot access Master UI after your application completes? Could 
you check the master log?

On Mon, Feb 29, 2016 at 3:48 PM, Sumona Routh  wrote:
Hi there,

I've been doing some performance tuning of our Spark application, which is 
using Spark 1.2.1 standalone. I have been using the spark metrics to graph out 
details as I run the jobs, as well as the UI to review the tasks and stages.


I notice that after my application completes, or is near completion, the UI 
"crashes." I get a Connection Refused response. Sometimes, the page eventually 
recovers and will load again, but sometimes I end up having to restart the 
Spark master to get it back. When I look at my graphs on the app, the memory 
consumption (of driver, executors, and what I believe to be the daemon 
(spark.jvm.total.used)) appears to be healthy. Monitoring the master machine 
itself, memory and CPU appear healthy as well.


Has anyone else seen this issue? Are there logs for the UI itself, and where 
might I find those?


Thanks!

Sumona

Fwd: Mapper side join with DataFrames API

2016-02-29 Thread Deepak Gopalakrishnan
Hello All,

I'm trying to join 2 dataframes A and B with a

sqlContext.sql("SELECT * FROM A INNER JOIN B ON A.a=B.a");

Now what I have done is that I have registeredTempTables for A and B after
loading these DataFrames from different sources. I need the join to be
really fast and I was wondering if there is a way to use the SQL statement
and then being able to do a mapper side join ( say my table B is small) ?

I read some articles on using broadcast to do mapper side joins. Could I do
something like this and then execute my sql statement to achieve mapper
side join ?

DataFrame B = sparkContext.broadcast(B);
B.registerTempTable("B");


I have a join as stated above and I see in my executor logs the below :

16/02/29 17:02:35 INFO TaskSetManager: Finished task 198.0 in stage 7.0
(TID 1114) in 20354 ms on localhost (196/200)

16/02/29 17:02:35 INFO ShuffleBlockFetcherIterator: Getting 200 non-empty
blocks out of 200 blocks

16/02/29 17:02:35 INFO ShuffleBlockFetcherIterator: Started 0 remote
fetches in 0 ms

16/02/29 17:02:35 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty
blocks out of 128 blocks

16/02/29 17:02:35 INFO ShuffleBlockFetcherIterator: Started 0 remote
fetches in 0 ms

16/02/29 17:03:03 INFO Executor: Finished task 199.0 in stage 7.0 (TID
1115). 2511 bytes result sent to driver

16/02/29 17:03:03 INFO TaskSetManager: Finished task 199.0 in stage 7.0
(TID 1115) in 27621 ms on localhost (197/200)

*16/02/29 17:07:06 INFO UnsafeExternalSorter: Thread 124 spilling sort data
of 256.0 KB to disk (0  time so far)*


Now, I have around 10G of executor memory and my memory faction should be
the default ( 0.75 as per the documentation) and my memory usage is < 1.5G(
obtained from the Storage tab on Spark dashboard), but still it says
spilling sort data. I'm a little surprised why this happens even when I
have enough memory free.

Any inputs will be greatly appreciated!

Thanks
-- 
Regards,
*Deepak Gopalakrishnan*
*Mobile*:+918891509774
*Skype* : deepakgk87
http://myexps.blogspot.com


Error when trying to insert data to a Parquet data source in HiveQL

2016-02-29 Thread SRK
Hi,

I seem to be getting the following error when I try to insert data to a
parquet datasource. Any idea as to why this is happening?


org.apache.hadoop.hive.ql.metadata.HiveException:
parquet.hadoop.MemoryManager$1: New Memory allocation 1045004 bytes is
smaller than the minimum allocation size of 1048576 bytes.
at
org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:249)
at
org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer.org$apache$spark$sql$hive$SparkHiveDynamicPartitionWriterContainer$$newWriter$1(hiveWriterContainers.scala:240)
at
org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer$$anonfun$getLocalFileWriter$1.apply(hiveWriterContainers.scala:249)
at
org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer$$anonfun$getLocalFileWriter$1.apply(hiveWriterContainers.scala:249)
at
scala.collection.mutable.MapLike$class.getOrElseUpdate(MapLike.scala:189)
at scala.collection.mutable.AbstractMap.getOrElseUpdate(Map.scala:91)
at
org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer.getLocalFileWriter(hiveWriterContainers.scala:249)
at
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:112)
at
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:104)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Error-when-trying-to-insert-data-to-a-Parquet-data-source-in-HiveQL-tp26365.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Stateful Operation on JavaPairDStream Help Needed !!

2016-02-29 Thread Shixiong(Ryan) Zhu
Could you post the screenshot of the Streaming DAG and also the driver log?
It would be great if you have a simple producer for us to debug.

On Mon, Feb 29, 2016 at 1:39 AM, Abhishek Anand 
wrote:

> Hi Ryan,
>
> Its not working even after removing the reduceByKey.
>
> So, basically I am doing the following
> - reading from kafka
> - flatmap inside transform
> - mapWithState
> - rdd.count on output of mapWithState
>
> But to my surprise still dont see checkpointing taking place.
>
> Is there any restriction to the type of operation that we can perform
> inside mapWithState ?
>
> Really need to resolve this one as currently if my application is
> restarted from checkpoint it has to repartition 120 previous stages which
> takes hell lot of time.
>
> Thanks !!
> Abhi
>
> On Mon, Feb 29, 2016 at 3:42 AM, Shixiong(Ryan) Zhu <
> shixi...@databricks.com> wrote:
>
>> Sorry that I forgot to tell you that you should also call `rdd.count()`
>> for "reduceByKey" as well. Could you try it and see if it works?
>>
>> On Sat, Feb 27, 2016 at 1:17 PM, Abhishek Anand 
>> wrote:
>>
>>> Hi Ryan,
>>>
>>> I am using mapWithState after doing reduceByKey.
>>>
>>> I am right now using mapWithState as you suggested and triggering the
>>> count manually.
>>>
>>> But, still unable to see any checkpointing taking place. In the DAG I
>>> can see that the reduceByKey operation for the previous batches are also
>>> being computed.
>>>
>>>
>>> Thanks
>>> Abhi
>>>
>>>
>>> On Tue, Feb 23, 2016 at 2:36 AM, Shixiong(Ryan) Zhu <
>>> shixi...@databricks.com> wrote:
>>>
 Hey Abhi,

 Using reducebykeyandwindow and mapWithState will trigger the bug
 in SPARK-6847. Here is a workaround to trigger checkpoint manually:

 JavaMapWithStateDStream<...> stateDStream =
 myPairDstream.mapWithState(StateSpec.function(mappingFunc));
 stateDStream.foreachRDD(new Function1<...>() {
   @Override
   public Void call(JavaRDD<...> rdd) throws Exception {
 rdd.count();
   }
 });
 return stateDStream.stateSnapshots();


 On Mon, Feb 22, 2016 at 12:25 PM, Abhishek Anand <
 abhis.anan...@gmail.com> wrote:

> Hi Ryan,
>
> Reposting the code.
>
> Basically my use case is something like - I am receiving the web
> impression logs and may get the notify (listening from kafka) for those
> impressions in the same interval (for me its 1 min) or any next interval
> (upto 2 hours). Now, when I receive notify for a particular impression I
> need to swap the date field in impression with the date field in notify
> logs. The notify for an impression has the same key as impression.
>
> static Function3, State,
> Tuple2> mappingFunc =
> new Function3, State,
> Tuple2>() {
> @Override
> public Tuple2 call(String key, Optional one,
> State state) {
> MyClass nullObj = new MyClass();
> nullObj.setImprLog(null);
> nullObj.setNotifyLog(null);
> MyClass current = one.or(nullObj);
>
> if(current!= null && current.getImprLog() != null &&
> current.getMyClassType() == 1 /*this is impression*/){
> return new Tuple2<>(key, null);
> }
> else if (current.getNotifyLog() != null  && current.getMyClassType()
> == 3 /*notify for the impression received*/){
> MyClass oldState = (state.exists() ? state.get() : nullObj);
> if(oldState!= null && oldState.getNotifyLog() != null){
> oldState.getImprLog().setCrtd(current.getNotifyLog().getCrtd());
>  //swappping the dates
> return new Tuple2<>(key, oldState);
> }
> else{
> return new Tuple2<>(key, null);
> }
> }
> else{
> return new Tuple2<>(key, null);
> }
>
> }
> };
>
>
> return
> myPairDstream.mapWithState(StateSpec.function(mappingFunc)).stateSnapshots();
>
>
> Currently I am using reducebykeyandwindow without the inverse function
> and I am able to get the correct data. But, issue the might arise is when 
> I
> have to restart my application from checkpoint and it repartitions and
> computes the previous 120 partitions, which delays the incoming batches.
>
>
> Thanks !!
> Abhi
>
> On Tue, Feb 23, 2016 at 1:25 AM, Shixiong(Ryan) Zhu <
> shixi...@databricks.com> wrote:
>
>> Hey Abhi,
>>
>> Could you post how you use mapWithState? By default, it should do
>> checkpointing every 10 batches.
>> However, there is a known issue that prevents mapWithState from
>> checkpointing in some special cases:
>> https://issues.apache.org/jira/browse/SPARK-6847
>>
>> On Mon, Feb 22, 2016 at 5:47 AM, Abhishek Anand <
>> abhis.anan...@gmail.com> wrote:
>>
>>> Any Insights on this one ?
>>>
>>>
>>> Thanks !!!
>>> Abhi
>>>
>>> On Mon, Feb 15, 2016 at 11:08 PM, Abhishek Anand <
>>> abhis.anan...@gmail.com> wrote:
>>>
 I am now

DataSet Evidence

2016-02-29 Thread Steve Lewis
 I have a relatively complex Java object that I would like to use in a
dataset

if I say

Encoder evidence = Encoders.kryo(MyType.class);

JavaRDD rddMyType= generateRDD(); // some code

 Dataset datasetMyType= sqlCtx.createDataset( rddMyType.rdd(),
evidence);


I get one column - the whole object

The object is a bean with all fields having getters and setters but
some of the fields are other complex java objects -

It would be fine to serielize the objects in these fields with Kryo or
Java serialization but the Bean serializer treats all referenced
objects as beans and some lack the required getter and setter fields

How can I get my columns with bean serializer even if some of the
values in the columns are not bean types


Re: Spark UI standalone "crashes" after an application finishes

2016-02-29 Thread Shixiong(Ryan) Zhu
Do you mean you cannot access Master UI after your application completes?
Could you check the master log?

On Mon, Feb 29, 2016 at 3:48 PM, Sumona Routh  wrote:

> Hi there,
> I've been doing some performance tuning of our Spark application, which is
> using Spark 1.2.1 standalone. I have been using the spark metrics to graph
> out details as I run the jobs, as well as the UI to review the tasks and
> stages.
>
> I notice that after my application completes, or is near completion, the
> UI "crashes." I get a Connection Refused response. Sometimes, the page
> eventually recovers and will load again, but sometimes I end up having to
> restart the Spark master to get it back. When I look at my graphs on the
> app, the memory consumption (of driver, executors, and what I believe to be
> the daemon (spark.jvm.total.used)) appears to be healthy. Monitoring the
> master machine itself, memory and CPU appear healthy as well.
>
> Has anyone else seen this issue? Are there logs for the UI itself, and
> where might I find those?
>
> Thanks!
> Sumona
>


Spark UI standalone "crashes" after an application finishes

2016-02-29 Thread Sumona Routh
Hi there,
I've been doing some performance tuning of our Spark application, which is
using Spark 1.2.1 standalone. I have been using the spark metrics to graph
out details as I run the jobs, as well as the UI to review the tasks and
stages.

I notice that after my application completes, or is near completion, the UI
"crashes." I get a Connection Refused response. Sometimes, the page
eventually recovers and will load again, but sometimes I end up having to
restart the Spark master to get it back. When I look at my graphs on the
app, the memory consumption (of driver, executors, and what I believe to be
the daemon (spark.jvm.total.used)) appears to be healthy. Monitoring the
master machine itself, memory and CPU appear healthy as well.

Has anyone else seen this issue? Are there logs for the UI itself, and
where might I find those?

Thanks!
Sumona


Re: Spark for client

2016-02-29 Thread Mich Talebzadeh
Thank you very much both

Zeppelin looks promising. Basically as I understand runs an agent on a
given port (I chose 21999) on the host that Spark is installed. I created a
notebook and running scripts through there. One thing for sure notebook
just returns the results rather all other stuff that one does not need/.

Cheers,



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 29 February 2016 at 19:22, Minudika Malshan 
wrote:

> +Adding resources
> https://zeppelin.incubator.apache.org/docs/latest/interpreter/spark.html
> https://zeppelin.incubator.apache.org
>
> Minudika Malshan
> Undergraduate
> Department of Computer Science and Engineering
> University of Moratuwa.
> *Mobile : +94715659887 <%2B94715659887>*
> *LinkedIn* : https://lk.linkedin.com/in/minudika
>
>
>
> On Tue, Mar 1, 2016 at 12:51 AM, Minudika Malshan 
> wrote:
>
>> Hi,
>>
>> I think zeppelin spark interpreter will give a solution to your problem.
>>
>> Regards.
>> Minudika
>>
>> Minudika Malshan
>> Undergraduate
>> Department of Computer Science and Engineering
>> University of Moratuwa.
>> *Mobile : +94715659887 <%2B94715659887>*
>> *LinkedIn* : https://lk.linkedin.com/in/minudika
>>
>>
>>
>> On Tue, Mar 1, 2016 at 12:35 AM, Sabarish Sasidharan <
>> sabarish.sasidha...@manthan.com> wrote:
>>
>>> Zeppelin?
>>>
>>> Regards
>>> Sab
>>> On 01-Mar-2016 12:27 am, "Mich Talebzadeh" 
>>> wrote:
>>>
 Hi,

 Is there such thing as Spark for client much like RDBMS client that
 have cut down version of their big brother useful for client connectivity
 but cannot be used as server.

 Thanks


 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com



>>>
>>
>


RE: a basic question on first use of PySpark shell and example, which is failing

2016-02-29 Thread Taylor, Ronald C
I guess I should also point out that I do an

export CLASSPATH

in my  .bash_profile file, so the CLASSPATH info should be usable by the 
PySpark shell that I invoke.

Ron

Ronald C. Taylor, Ph.D.
Computational Biology & Bioinformatics Group
Pacific Northwest National Laboratory (U.S. Dept of Energy/Battelle)
Richland, WA 99352
phone: (509) 372-6568,  email: ronald.tay...@pnnl.gov
web page:  http://www.pnnl.gov/science/staff/staff_info.asp?staff_num=7048

From: Taylor, Ronald C
Sent: Monday, February 29, 2016 2:57 PM
To: 'Yin Yang'; user@spark.apache.org
Cc: Jules Damji; ronald.taylo...@gmail.com; Taylor, Ronald C
Subject: RE: a basic question on first use of PySpark shell and example, which 
is failing

HI Yin,

My Classpath is set to:

CLASSPATH=/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/*:/people/rtaylor/SparkWork/DataAlgUtils:.

And there is indeed a spark-core.jar in the ../jars subdirectory, though it is 
not named precisely “spark-core.jar”. It has a version number in its name, as 
you can see:

[rtaylor@bigdatann ~]$ find 
/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars -name "spark-core*.jar"

/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/spark-core_2.10-1.5.0-cdh5.5.1.jar

I extracted the class names into a text file:

[rtaylor@bigdatann jars]$ jar tf spark-core_2.10-1.5.0-cdh5.5.1.jar > 
/people/rtaylor/SparkWork/jar_file_listing_of_spark-core_jar.txt

And then searched for RDDOperationScope. I found these classes:

[rtaylor@bigdatann SparkWork]$ grep RDDOperationScope 
jar_file_listing_of_spark-core_jar.txt

org/apache/spark/rdd/RDDOperationScope$$anonfun$5.class
org/apache/spark/rdd/RDDOperationScope$$anonfun$3.class
org/apache/spark/rdd/RDDOperationScope$$anonfun$4$$anonfun$apply$1.class
org/apache/spark/rdd/RDDOperationScope$$anonfun$4.class
org/apache/spark/rdd/RDDOperationScope$$anonfun$1.class
org/apache/spark/rdd/RDDOperationScope$$anonfun$getAllScopes$2.class
org/apache/spark/rdd/RDDOperationScope$.class
org/apache/spark/rdd/RDDOperationScope$$anonfun$getAllScopes$1.class
org/apache/spark/rdd/RDDOperationScope.class
org/apache/spark/rdd/RDDOperationScope$$anonfun$2.class
[rtaylor@bigdatann SparkWork]$


It looks like the RDDOperationScope class is present. Shouldn’t that work?

Ron

Ronald C. Taylor, Ph.D.
Computational Biology & Bioinformatics Group
Pacific Northwest National Laboratory (U.S. Dept of Energy/Battelle)
Richland, WA 99352
phone: (509) 372-6568,  email: 
ronald.tay...@pnnl.gov
web page:  http://www.pnnl.gov/science/staff/staff_info.asp?staff_num=7048

From: Yin Yang [mailto:yy201...@gmail.com]
Sent: Monday, February 29, 2016 2:27 PM
To: Taylor, Ronald C
Cc: Jules Damji; user@spark.apache.org; 
ronald.taylo...@gmail.com
Subject: Re: a basic question on first use of PySpark shell and example, which 
is failing

RDDOperationScope is in spark-core_2.1x jar file.

  7148 Mon Feb 29 09:21:32 PST 2016 org/apache/spark/rdd/RDDOperationScope.class

Can you check whether the spark-core jar is in classpath ?

FYI

On Mon, Feb 29, 2016 at 1:40 PM, Taylor, Ronald C 
mailto:ronald.tay...@pnnl.gov>> wrote:
Hi Jules, folks,

I have tried “hdfs://” as well as “file://”.  And several variants. 
Every time, I get the same msg – NoClassDefFoundError. See below. Why do I get 
such a msg, if the problem is simply that Spark cannot find the text file? 
Doesn’t the error msg indicate some other source of the problem?

I may be missing something in the error report; I am a Java person, not a 
Python programmer.  But doesn’t it look like a call to a Java class –something 
associated with “o9.textFile” -  is failing?  If so, how do I fix this?

  Ron


"/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/pyspark/context.py",
 line 451, in textFile
return RDD(self._jsc.textFile(name, minPartitions), self,
  File 
"/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
 line 538, in __call__
  File 
"/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/pyspark/sql/utils.py",
 line 36, in deco
return f(*a, **kw)
  File 
"/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
 line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o9.textFile.
: java.lang.NoClassDefFoundError: Could not initialize class 
org.apache.spark.rdd.RDDOperationScope$

Ronald C. Taylor, Ph.D.
Computational Biology & Bioinformatics Group
Pacific Northwest National Laboratory (U.S. Dept of Energy/Battelle)
Richland, WA 99352
phone: (509) 372-6568,  email: 
ronald.tay...@pnnl.gov
web page:  http://www.pnnl.gov/science/staff/staff_info.asp?staff_num=7048

From: Jules Damji [mailto:dmat...@comcast.net]
Sent: Sunday, February 28, 2016 10:07 PM
To: Taylor, Ronald C
Cc: user@spa

RE: a basic question on first use of PySpark shell and example, which is failing

2016-02-29 Thread Taylor, Ronald C
HI Yin,

My Classpath is set to:

CLASSPATH=/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/*:/people/rtaylor/SparkWork/DataAlgUtils:.

And there is indeed a spark-core.jar in the ../jars subdirectory, though it is 
not named precisely “spark-core.jar”. It has a version number in its name, as 
you can see:

[rtaylor@bigdatann ~]$ find 
/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars -name "spark-core*.jar"

/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/spark-core_2.10-1.5.0-cdh5.5.1.jar

I extracted the class names into a text file:

[rtaylor@bigdatann jars]$ jar tf spark-core_2.10-1.5.0-cdh5.5.1.jar > 
/people/rtaylor/SparkWork/jar_file_listing_of_spark-core_jar.txt

And then searched for RDDOperationScope. I found these classes:

[rtaylor@bigdatann SparkWork]$ grep RDDOperationScope 
jar_file_listing_of_spark-core_jar.txt

org/apache/spark/rdd/RDDOperationScope$$anonfun$5.class
org/apache/spark/rdd/RDDOperationScope$$anonfun$3.class
org/apache/spark/rdd/RDDOperationScope$$anonfun$4$$anonfun$apply$1.class
org/apache/spark/rdd/RDDOperationScope$$anonfun$4.class
org/apache/spark/rdd/RDDOperationScope$$anonfun$1.class
org/apache/spark/rdd/RDDOperationScope$$anonfun$getAllScopes$2.class
org/apache/spark/rdd/RDDOperationScope$.class
org/apache/spark/rdd/RDDOperationScope$$anonfun$getAllScopes$1.class
org/apache/spark/rdd/RDDOperationScope.class
org/apache/spark/rdd/RDDOperationScope$$anonfun$2.class
[rtaylor@bigdatann SparkWork]$


It looks like the RDDOperationScope class is present. Shouldn’t that work?

Ron

Ronald C. Taylor, Ph.D.
Computational Biology & Bioinformatics Group
Pacific Northwest National Laboratory (U.S. Dept of Energy/Battelle)
Richland, WA 99352
phone: (509) 372-6568,  email: ronald.tay...@pnnl.gov
web page:  http://www.pnnl.gov/science/staff/staff_info.asp?staff_num=7048

From: Yin Yang [mailto:yy201...@gmail.com]
Sent: Monday, February 29, 2016 2:27 PM
To: Taylor, Ronald C
Cc: Jules Damji; user@spark.apache.org; ronald.taylo...@gmail.com
Subject: Re: a basic question on first use of PySpark shell and example, which 
is failing

RDDOperationScope is in spark-core_2.1x jar file.

  7148 Mon Feb 29 09:21:32 PST 2016 org/apache/spark/rdd/RDDOperationScope.class

Can you check whether the spark-core jar is in classpath ?

FYI

On Mon, Feb 29, 2016 at 1:40 PM, Taylor, Ronald C 
mailto:ronald.tay...@pnnl.gov>> wrote:
Hi Jules, folks,

I have tried “hdfs://” as well as “file://”.  And several variants. 
Every time, I get the same msg – NoClassDefFoundError. See below. Why do I get 
such a msg, if the problem is simply that Spark cannot find the text file? 
Doesn’t the error msg indicate some other source of the problem?

I may be missing something in the error report; I am a Java person, not a 
Python programmer.  But doesn’t it look like a call to a Java class –something 
associated with “o9.textFile” -  is failing?  If so, how do I fix this?

  Ron


"/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/pyspark/context.py",
 line 451, in textFile
return RDD(self._jsc.textFile(name, minPartitions), self,
  File 
"/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
 line 538, in __call__
  File 
"/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/pyspark/sql/utils.py",
 line 36, in deco
return f(*a, **kw)
  File 
"/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
 line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o9.textFile.
: java.lang.NoClassDefFoundError: Could not initialize class 
org.apache.spark.rdd.RDDOperationScope$

Ronald C. Taylor, Ph.D.
Computational Biology & Bioinformatics Group
Pacific Northwest National Laboratory (U.S. Dept of Energy/Battelle)
Richland, WA 99352
phone: (509) 372-6568,  email: 
ronald.tay...@pnnl.gov
web page:  http://www.pnnl.gov/science/staff/staff_info.asp?staff_num=7048

From: Jules Damji [mailto:dmat...@comcast.net]
Sent: Sunday, February 28, 2016 10:07 PM
To: Taylor, Ronald C
Cc: user@spark.apache.org; 
ronald.taylo...@gmail.com
Subject: Re: a basic question on first use of PySpark shell and example, which 
is failing


Hello Ronald,

Since you have placed the file under HDFS, you might same change the path name 
to:

val lines = sc.textFile("hdfs://user/taylor/Spark/Warehouse.java")

Sent from my iPhone
Pardon the dumb thumb typos :)

On Feb 28, 2016, at 9:36 PM, Taylor, Ronald C 
mailto:ronald.tay...@pnnl.gov>> wrote:

Hello folks,

I  am a newbie, and am running Spark on a small Cloudera CDH 5.5.1 cluster at 
our lab. I am trying to use the PySpark shell for the first time. and am 
attempting to  duplicate the documentation example of creating an RDD  which I 
called "lines" using a text file.
I placed a a text file c

Re: a basic question on first use of PySpark shell and example, which is failing

2016-02-29 Thread Yin Yang
RDDOperationScope is in spark-core_2.1x jar file.

  7148 Mon Feb 29 09:21:32 PST 2016
org/apache/spark/rdd/RDDOperationScope.class

Can you check whether the spark-core jar is in classpath ?

FYI

On Mon, Feb 29, 2016 at 1:40 PM, Taylor, Ronald C 
wrote:

> Hi Jules, folks,
>
>
>
> I have tried “hdfs://” as well as “file:// filepath>”.  And several variants. Every time, I get the same msg –
> NoClassDefFoundError. See below. Why do I get such a msg, if the problem is
> simply that Spark cannot find the text file? Doesn’t the error msg indicate
> some other source of the problem?
>
>
>
> I may be missing something in the error report; I am a Java person, not a
> Python programmer.  But doesn’t it look like a call to a Java class
> –something associated with “o9.textFile” -  is failing?  If so, how do I
> fix this?
>
>
>
>   Ron
>
>
>
>
>
> "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/pyspark/context.py",
> line 451, in textFile
>
> return RDD(self._jsc.textFile(name, minPartitions), self,
>
>   File
> "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
> line 538, in __call__
>
>   File
> "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/pyspark/sql/utils.py",
> line 36, in deco
>
> return f(*a, **kw)
>
>   File
> "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
> line 300, in get_return_value
>
> py4j.protocol.Py4JJavaError: An error occurred while calling o9.textFile.
>
> : java.lang.NoClassDefFoundError: Could not initialize class
> org.apache.spark.rdd.RDDOperationScope$
>
>
>
> Ronald C. Taylor, Ph.D.
>
> Computational Biology & Bioinformatics Group
>
> Pacific Northwest National Laboratory (U.S. Dept of Energy/Battelle)
>
> Richland, WA 99352
>
> phone: (509) 372-6568,  email: ronald.tay...@pnnl.gov
>
> web page:  http://www.pnnl.gov/science/staff/staff_info.asp?staff_num=7048
>
>
>
> *From:* Jules Damji [mailto:dmat...@comcast.net]
> *Sent:* Sunday, February 28, 2016 10:07 PM
> *To:* Taylor, Ronald C
> *Cc:* user@spark.apache.org; ronald.taylo...@gmail.com
> *Subject:* Re: a basic question on first use of PySpark shell and
> example, which is failing
>
>
>
>
>
> Hello Ronald,
>
>
>
> Since you have placed the file under HDFS, you might same change the path
> name to:
>
>
>
> val lines = sc.textFile("hdfs://user/taylor/Spark/Warehouse.java")
>
>
> Sent from my iPhone
>
> Pardon the dumb thumb typos :)
>
>
> On Feb 28, 2016, at 9:36 PM, Taylor, Ronald C 
> wrote:
>
>
>
> Hello folks,
>
>
>
> I  am a newbie, and am running Spark on a small Cloudera CDH 5.5.1 cluster
> at our lab. I am trying to use the PySpark shell for the first time. and am
> attempting to  duplicate the documentation example of creating an RDD
> which I called "lines" using a text file.
>
> I placed a a text file called Warehouse.java in this HDFS location:
>
>
> [rtaylor@bigdatann ~]$ hadoop fs -ls /user/rtaylor/Spark
> -rw-r--r--   3 rtaylor supergroup1155355 2016-02-28 18:09
> /user/rtaylor/Spark/Warehouse.java
> [rtaylor@bigdatann ~]$
>
> I then invoked sc.textFile()in the PySpark shell.That did not work. See
> below. Apparently a class is not found? Don't know why that would be the
> case. Any guidance would be very much appreciated.
>
> The Cloudera Manager for the cluster says that Spark is operating  in the
> "green", for whatever that is worth.
>
>  - Ron Taylor
>
>
> >>> lines = sc.textFile("file:///user/taylor/Spark/Warehouse.java")
>
> Traceback (most recent call last):
>   File "", line 1, in 
>   File
> "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/pyspark/context.py",
> line 451, in textFile
> return RDD(self._jsc.textFile(name, minPartitions), self,
>   File
> "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
> line 538, in __call__
>   File
> "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/pyspark/sql/utils.py",
> line 36, in deco
> return f(*a, **kw)
>   File
> "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
> line 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o9.textFile.
> : java.lang.NoClassDefFoundError: Could not initialize class
> org.apache.spark.rdd.RDDOperationScope$
> at org.apache.spark.SparkContext.withScope(SparkContext.scala:709)
> at org.apache.spark.SparkContext.textFile(SparkContext.scala:825)
> at
> org.apache.spark.api.java.JavaSparkContext.textFile(JavaSparkContext.scala:191)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at py4j.

Re: Spark on Windows platform

2016-02-29 Thread Steve Loughran

On 29 Feb 2016, at 13:40, gaurav pathak 
mailto:gauravpathak...@gmail.com>> wrote:


Thanks Jorn.

Any guidance on how to get started with getting SPARK on Windows, is highly 
appreciated.

Thanks & Regards

Gaurav Pathak


you are at risk of seeing stack traces when you try to talk to the local 
filesystem, on account of (a) hadoop being part of the process and (b) it 
needing some native windows binaries

details: https://wiki.apache.org/hadoop/WindowsProblems

those binaries: https://github.com/steveloughran/winutils

(I need to add some 2.7.2 binaries in there, by the look of things)



Re: perl Kafka::Producer, “Kafka::Exception::Producer”, “code”, -1000, “message”, "Invalid argument

2016-02-29 Thread Cody Koeninger
Does this issue involve Spark at all?  Otherwise you may have better luck
on a perl or kafka related list.

On Mon, Feb 29, 2016 at 3:26 PM, Vinti Maheshwari 
wrote:

> Hi All,
>
> I wrote kafka producer using kafka perl api, But i am getting error when i
> am passing variable for sending message while if i am hard coding the
> message data it's not giving any error.
>
> Perl program, where i added kafka producer code:
>
> try {
> $kafka_connection = Kafka::Connection->new( host => 
> $hadoop_server, port => '6667' );
> $producer = Kafka::Producer->new( Connection => 
> $kafka_connection );
> my $topic = 'test1';
> my $partition = 0;
> my $message = $hadoop_str;
> my $response = $producer->send(
> $topic, # topic
> $partition,  # partition
> 
> #"56b4b2b23c24c3608376d1ea,/obj/i386/ui/lib/access/daemon_map.So.gcda,1,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0"
>  # message
> $hadoop_str
> #"t1,f9,1,1,1"
> );
> } catch {
> my $error = $_;
> if ( blessed( $error ) && $error->isa( 
> 'Kafka::Exception' ) ) {
> warn 'Error: (', $error->code, ') ',  
> $error->message, "\n";
> exit;
> } else {
> die $error;
> }
> };#CCLib::run_system_cmd( $cmd );
> }
>
> Error Log: -bash-3.2$ ./stream_binary_hadoop.pl print (...) interpreted
> as function at ./stream_binary_hadoop.pl line 429. Invalid argument:
> message =
> 56b4b2b23c24c3608376d1ea,/obj/i386/junos/usr.sbin/lmpd/lmpd_repl_msg_idr.gcda,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
> at
> /opt/adp/projects/code_coverage/perl//5.10/lib/site_perl/5.10.1/Exception/Class/Base.pm
> line 85. Exception::Class::Base::throw("Kafka::Exception::Producer",
> "code", -1000, "message", "Invalid argument: message =
> 56b4b2b23c24c3608376d1ea,/obj/i38"...) called at
> /opt/adp/projects/code_coverage/perl//5.10/lib/site_perl/5.10.1/Kafka/Producer.pm
> line 374 Kafka::Producer::_error(Kafka::Producer=HASH(0x36955f8), -1000,
> "message = 56b4b2b23c24c3608376d1ea,/obj/i386/junos/usr.sbin/l"...) called
> at
> /opt/adp/projects/code_coverage/perl//5.10/lib/site_perl/5.10.1/Kafka/Producer.pm
> line 331 Kafka::Producer::send(Kafka::Producer=HASH(0x36955f8), "test1", 0,
> "56b4b2b23c24c3608376d1ea,/obj/i386/junos/usr.sbin/lmpd/lmpd_r"...) called
> at ./stream_binary_hadoop.pl line 175 main::try {...} () called at
> /opt/adp/projects/code_coverage/perl//5.10/lib/site_perl/5.10.1/Try/Tiny.pm
> line 81 eval {...} called at
> /opt/adp/projects/code_coverage/perl//5.10/lib/site_perl/5.10.1/Try/Tiny.pm
> line 72 Try::Tiny::try(CODE(0x3692888), Try::Tiny::Catch=REF(0x3692c78))
> called at ./stream_binary_hadoop.pl line 190
> main::stream(HASH(0x3692708)) called at ./stream_binary_hadoop.pl line
> 354 main::file_split(HASH(0x36927b0)) called at ./stream_binary_hadoop.pl
> line 413
>
> at ./stream_binary_hadoop.pl line 188. main::catch {...} (" Invalid
> argument: message = 56b4b2b23c24c3608376d1e"...) called at
> /opt/adp/projects/code_coverage/perl//5.10/lib/site_perl/5.10.1/Try/Tiny.pm
> line 104 Try::Tiny::try(CODE(0x3692888), Try::Tiny::Catch=REF(0x3692c78))
> called at ./stream_binary_hadoop.pl line 190
> main::stream(HASH(0x3692708)) called at ./stream_binary_hadoop.pl line
> 354 main::file_split(HASH(0x36927b0)) called at ./stream_binary_hadoop.pl
> line 413
>
>
>
> Thank & Regards,
>
> ~Vinti
>


RE: a basic question on first use of PySpark shell and example, which is failing

2016-02-29 Thread Taylor, Ronald C
Hi Jules, folks,

I have tried “hdfs://” as well as “file://”.  And several variants. 
Every time, I get the same msg – NoClassDefFoundError. See below. Why do I get 
such a msg, if the problem is simply that Spark cannot find the text file? 
Doesn’t the error msg indicate some other source of the problem?

I may be missing something in the error report; I am a Java person, not a 
Python programmer.  But doesn’t it look like a call to a Java class –something 
associated with “o9.textFile” -  is failing?  If so, how do I fix this?

  Ron


"/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/pyspark/context.py",
 line 451, in textFile
return RDD(self._jsc.textFile(name, minPartitions), self,
  File 
"/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
 line 538, in __call__
  File 
"/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/pyspark/sql/utils.py",
 line 36, in deco
return f(*a, **kw)
  File 
"/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
 line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o9.textFile.
: java.lang.NoClassDefFoundError: Could not initialize class 
org.apache.spark.rdd.RDDOperationScope$

Ronald C. Taylor, Ph.D.
Computational Biology & Bioinformatics Group
Pacific Northwest National Laboratory (U.S. Dept of Energy/Battelle)
Richland, WA 99352
phone: (509) 372-6568,  email: ronald.tay...@pnnl.gov
web page:  http://www.pnnl.gov/science/staff/staff_info.asp?staff_num=7048

From: Jules Damji [mailto:dmat...@comcast.net]
Sent: Sunday, February 28, 2016 10:07 PM
To: Taylor, Ronald C
Cc: user@spark.apache.org; ronald.taylo...@gmail.com
Subject: Re: a basic question on first use of PySpark shell and example, which 
is failing


Hello Ronald,

Since you have placed the file under HDFS, you might same change the path name 
to:

val lines = sc.textFile("hdfs://user/taylor/Spark/Warehouse.java")

Sent from my iPhone
Pardon the dumb thumb typos :)

On Feb 28, 2016, at 9:36 PM, Taylor, Ronald C 
mailto:ronald.tay...@pnnl.gov>> wrote:

Hello folks,

I  am a newbie, and am running Spark on a small Cloudera CDH 5.5.1 cluster at 
our lab. I am trying to use the PySpark shell for the first time. and am 
attempting to  duplicate the documentation example of creating an RDD  which I 
called "lines" using a text file.
I placed a a text file called Warehouse.java in this HDFS location:

[rtaylor@bigdatann ~]$ hadoop fs -ls /user/rtaylor/Spark
-rw-r--r--   3 rtaylor supergroup1155355 2016-02-28 18:09 
/user/rtaylor/Spark/Warehouse.java
[rtaylor@bigdatann ~]$

I then invoked sc.textFile()in the PySpark shell.That did not work. See below. 
Apparently a class is not found? Don't know why that would be the case. Any 
guidance would be very much appreciated.
The Cloudera Manager for the cluster says that Spark is operating  in the 
"green", for whatever that is worth.
 - Ron Taylor

>>> lines = 
>>> sc.textFile("file:///user/taylor/Spark/Warehouse.java")

Traceback (most recent call last):
  File "", line 1, in 
  File 
"/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/pyspark/context.py",
 line 451, in textFile
return RDD(self._jsc.textFile(name, minPartitions), self,
  File 
"/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
 line 538, in __call__
  File 
"/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/pyspark/sql/utils.py",
 line 36, in deco
return f(*a, **kw)
  File 
"/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
 line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o9.textFile.
: java.lang.NoClassDefFoundError: Could not initialize class 
org.apache.spark.rdd.RDDOperationScope$
at org.apache.spark.SparkContext.withScope(SparkContext.scala:709)
at org.apache.spark.SparkContext.textFile(SparkContext.scala:825)
at 
org.apache.spark.api.java.JavaSparkContext.textFile(JavaSparkContext.scala:191)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)

>>>


perl Kafka::Producer, “Kafka::Exception::Producer”, “code”, -1000, “message”, "Invalid argument

2016-02-29 Thread Vinti Maheshwari
Hi All,

I wrote kafka producer using kafka perl api, But i am getting error when i
am passing variable for sending message while if i am hard coding the
message data it's not giving any error.

Perl program, where i added kafka producer code:

try {
$kafka_connection = Kafka::Connection->new(
host => $hadoop_server, port => '6667' );
$producer = Kafka::Producer->new( Connection
=> $kafka_connection );
my $topic = 'test1';
my $partition = 0;
my $message = $hadoop_str;
my $response = $producer->send(
$topic, # topic
$partition,  # partition

#"56b4b2b23c24c3608376d1ea,/obj/i386/ui/lib/access/daemon_map.So.gcda,1,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0"
# message
$hadoop_str
#"t1,f9,1,1,1"
);
} catch {
my $error = $_;
if ( blessed( $error ) && $error->isa(
'Kafka::Exception' ) ) {
warn 'Error: (', $error->code, ') ',
$error->message, "\n";
exit;
} else {
die $error;
}
};#CCLib::run_system_cmd( $cmd );
}

Error Log: -bash-3.2$ ./stream_binary_hadoop.pl print (...) interpreted as
function at ./stream_binary_hadoop.pl line 429. Invalid argument: message =
56b4b2b23c24c3608376d1ea,/obj/i386/junos/usr.sbin/lmpd/lmpd_repl_msg_idr.gcda,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
at
/opt/adp/projects/code_coverage/perl//5.10/lib/site_perl/5.10.1/Exception/Class/Base.pm
line 85. Exception::Class::Base::throw("Kafka::Exception::Producer",
"code", -1000, "message", "Invalid argument: message =
56b4b2b23c24c3608376d1ea,/obj/i38"...) called at
/opt/adp/projects/code_coverage/perl//5.10/lib/site_perl/5.10.1/Kafka/Producer.pm
line 374 Kafka::Producer::_error(Kafka::Producer=HASH(0x36955f8), -1000,
"message = 56b4b2b23c24c3608376d1ea,/obj/i386/junos/usr.sbin/l"...) called
at
/opt/adp/projects/code_coverage/perl//5.10/lib/site_perl/5.10.1/Kafka/Producer.pm
line 331 Kafka::Producer::send(Kafka::Producer=HASH(0x36955f8), "test1", 0,
"56b4b2b23c24c3608376d1ea,/obj/i386/junos/usr.sbin/lmpd/lmpd_r"...) called
at ./stream_binary_hadoop.pl line 175 main::try {...} () called at
/opt/adp/projects/code_coverage/perl//5.10/lib/site_perl/5.10.1/Try/Tiny.pm
line 81 eval {...} called at
/opt/adp/projects/code_coverage/perl//5.10/lib/site_perl/5.10.1/Try/Tiny.pm
line 72 Try::Tiny::try(CODE(0x3692888), Try::Tiny::Catch=REF(0x3692c78))
called at ./stream_binary_hadoop.pl line 190 main::stream(HASH(0x3692708))
called at ./stream_binary_hadoop.pl line 354
main::file_split(HASH(0x36927b0)) called at ./stream_binary_hadoop.pl line
413

at ./stream_binary_hadoop.pl line 188. main::catch {...} (" Invalid
argument: message = 56b4b2b23c24c3608376d1e"...) called at
/opt/adp/projects/code_coverage/perl//5.10/lib/site_perl/5.10.1/Try/Tiny.pm
line 104 Try::Tiny::try(CODE(0x3692888), Try::Tiny::Catch=REF(0x3692c78))
called at ./stream_binary_hadoop.pl line 190 main::stream(HASH(0x3692708))
called at ./stream_binary_hadoop.pl line 354
main::file_split(HASH(0x36927b0)) called at ./stream_binary_hadoop.pl line
413



Thank & Regards,

~Vinti


Re: Spark Integration Patterns

2016-02-29 Thread Alexander Pivovarov
There is a spark-jobserver (SJS) which is REST interface for spark and
spark-sql
you can deploy your jar file with Jobs impl to spark-jobserver
and use rest API to submit jobs in synch or async mode
in sync mode you need to poll SJS to get job result
job result might be actual data in json or path on s3 / hdfs with the data

There is an instruction on how to start job-server on AWS EMR and submit
simple workdcount job using culr
https://github.com/spark-jobserver/spark-jobserver/blob/master/doc/EMR.md

On Mon, Feb 29, 2016 at 12:54 PM, skaarthik oss 
wrote:

> Check out http://toree.incubator.apache.org/. It might help with your
> need.
>
>
>
> *From:* moshir mikael [mailto:moshir.mik...@gmail.com]
> *Sent:* Monday, February 29, 2016 5:58 AM
> *To:* Alex Dzhagriev 
> *Cc:* user 
> *Subject:* Re: Spark Integration Patterns
>
>
>
> Thanks, will check too, however : just want to use Spark core RDD and
> standard data sources.
>
>
>
> Le lun. 29 févr. 2016 à 14:54, Alex Dzhagriev  a écrit :
>
> Hi Moshir,
>
>
>
> Regarding the streaming, you can take a look at the spark streaming, the
> micro-batching framework. If it satisfies your needs it has a bunch of
> integrations. Thus, the source for the jobs could be Kafka, Flume or Akka.
>
>
>
> Cheers, Alex.
>
>
>
> On Mon, Feb 29, 2016 at 2:48 PM, moshir mikael 
> wrote:
>
> Hi Alex,
>
> thanks for the link. Will check it.
>
> Does someone know of a more streamlined approach ?
>
>
>
>
>
>
>
> Le lun. 29 févr. 2016 à 10:28, Alex Dzhagriev  a écrit :
>
> Hi Moshir,
>
>
>
> I think you can use the rest api provided with Spark:
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/rest/RestSubmissionServer.scala
>
>
>
> Unfortunately, I haven't find any documentation, but it looks fine.
>
> Thanks, Alex.
>
>
>
> On Sun, Feb 28, 2016 at 3:25 PM, mms  wrote:
>
> Hi, I cannot find a simple example showing how a typical application can
> 'connect' to a remote spark cluster and interact with it. Let's say I have
> a Python web application hosted somewhere *outside *a spark cluster, with
> just python installed on it. How can I talk to Spark without using a
> notebook, or using ssh to connect to a cluster master node ? I know of
> spark-submit and spark-shell, however forking a process on a remote host to
> execute a shell script seems like a lot of effort What are the recommended
> ways to connect and query Spark from a remote client ? Thanks Thx !
> --
>
> View this message in context: Spark Integration Patterns
> 
> Sent from the Apache Spark User List mailing list archive
>  at Nabble.com.
>
>
>
>
>
>


Re: Flattening Data within DataFrames

2016-02-29 Thread Kevin Mellott
Thanks Michal - this is exactly what I need.

On Mon, Feb 29, 2016 at 11:40 AM, Michał Zieliński <
zielinski.mich...@gmail.com> wrote:

> Hi Kevin,
>
> This should help:
>
> https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-spark.html
>
> On 29 February 2016 at 16:54, Kevin Mellott 
> wrote:
>
>> Fellow Sparkers,
>>
>> I'm trying to "flatten" my view of data within a DataFrame, and am having
>> difficulties doing so. The DataFrame contains product information, which
>> includes multiple levels of categories (primary, secondary, etc).
>>
>> *Example Data (Raw):*
>> *NameLevelCategory*
>> Baked CodeFood 1
>> Baked CodeSeafood 2
>> Baked CodeFish   3
>> Hockey Stick  Sports1
>> Hockey Stick  Hockey  2
>> Hockey Stick  Equipment  3
>>
>> *Desired Data:*
>> *NameCategory1 Category2 Category3*
>> Baked CodeFood  Seafood Fish
>> Hockey Stick  SportsHockey  Equipment
>>
>> *Approach:*
>> After parsing the "raw" information into two separate DataFrames (called 
>> *products
>> *and *categories*) and registering them as a Spark SQL tables, I was
>> attempting to perform the following query to flatten this all into the
>> "desired data" (depicted above).
>>
>> products.registerTempTable("products")
>> categories.registerTempTable("categories")
>>
>> val productList = sqlContext.sql(
>>   " SELECT p.Name, " +
>>   " c1.Description AS Category1, " +
>>   " c2.Description AS Category2, " +
>>   " c3.Description AS Category3 " +
>>   " FROM products AS p " +
>>   "   JOIN categories AS c1 " +
>>   " ON c1.Name = p.Name AND c1.Level = '1' "
>>   "   JOIN categories AS c2 " +
>>   " ON c2.Name = p.Name AND c2.Level = '2' "
>>   "   JOIN categories AS c3 " +
>>   " ON c3.Name = p.Name AND c3.Level = '3' "
>>
>> *Issue:*
>> I get an error when running my query above, because I am not able to JOIN
>> the *categories* table more than once. Has anybody dealt with this type
>> of use case before, and if so how did you achieve the desired behavior?
>>
>> Thank you in advance for your thoughts.
>>
>> Kevin
>>
>
>


RE: Spark Integration Patterns

2016-02-29 Thread skaarthik oss
Check out http://toree.incubator.apache.org/. It might help with your need.

 

From: moshir mikael [mailto:moshir.mik...@gmail.com] 
Sent: Monday, February 29, 2016 5:58 AM
To: Alex Dzhagriev 
Cc: user 
Subject: Re: Spark Integration Patterns

 

Thanks, will check too, however : just want to use Spark core RDD and standard 
data sources.

 

Le lun. 29 févr. 2016 à 14:54, Alex Dzhagriev mailto:dzh...@gmail.com> > a écrit :

Hi Moshir,

 

Regarding the streaming, you can take a look at the spark streaming, the 
micro-batching framework. If it satisfies your needs it has a bunch of 
integrations. Thus, the source for the jobs could be Kafka, Flume or Akka.

 

Cheers, Alex.

 

On Mon, Feb 29, 2016 at 2:48 PM, moshir mikael mailto:moshir.mik...@gmail.com> > wrote:

Hi Alex,

thanks for the link. Will check it.

Does someone know of a more streamlined approach ?

 

 

 

Le lun. 29 févr. 2016 à 10:28, Alex Dzhagriev mailto:dzh...@gmail.com> > a écrit :

Hi Moshir,

 

I think you can use the rest api provided with Spark: 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/rest/RestSubmissionServer.scala

 

Unfortunately, I haven't find any documentation, but it looks fine.

Thanks, Alex.

 

On Sun, Feb 28, 2016 at 3:25 PM, mms mailto:moshir.mik...@gmail.com> > wrote:

Hi, I cannot find a simple example showing how a typical application can 
'connect' to a remote spark cluster and interact with it. Let's say I have a 
Python web application hosted somewhere outside a spark cluster, with just 
python installed on it. How can I talk to Spark without using a notebook, or 
using ssh to connect to a cluster master node ? I know of spark-submit and 
spark-shell, however forking a process on a remote host to execute a shell 
script seems like a lot of effort What are the recommended ways to connect and 
query Spark from a remote client ? Thanks Thx ! 

  _  

View this message in context: Spark Integration Patterns 

 
Sent from the Apache Spark User List mailing list archive 
  at Nabble.com.

 

 



Re: Spark for client

2016-02-29 Thread Minudika Malshan
+Adding resources
https://zeppelin.incubator.apache.org/docs/latest/interpreter/spark.html
https://zeppelin.incubator.apache.org

Minudika Malshan
Undergraduate
Department of Computer Science and Engineering
University of Moratuwa.
*Mobile : +94715659887*
*LinkedIn* : https://lk.linkedin.com/in/minudika



On Tue, Mar 1, 2016 at 12:51 AM, Minudika Malshan 
wrote:

> Hi,
>
> I think zeppelin spark interpreter will give a solution to your problem.
>
> Regards.
> Minudika
>
> Minudika Malshan
> Undergraduate
> Department of Computer Science and Engineering
> University of Moratuwa.
> *Mobile : +94715659887 <%2B94715659887>*
> *LinkedIn* : https://lk.linkedin.com/in/minudika
>
>
>
> On Tue, Mar 1, 2016 at 12:35 AM, Sabarish Sasidharan <
> sabarish.sasidha...@manthan.com> wrote:
>
>> Zeppelin?
>>
>> Regards
>> Sab
>> On 01-Mar-2016 12:27 am, "Mich Talebzadeh" 
>> wrote:
>>
>>> Hi,
>>>
>>> Is there such thing as Spark for client much like RDBMS client that have
>>> cut down version of their big brother useful for client connectivity but
>>> cannot be used as server.
>>>
>>> Thanks
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>
>


Re: Spark for client

2016-02-29 Thread Minudika Malshan
Hi,

I think zeppelin spark interpreter will give a solution to your problem.

Regards.
Minudika

Minudika Malshan
Undergraduate
Department of Computer Science and Engineering
University of Moratuwa.
*Mobile : +94715659887*
*LinkedIn* : https://lk.linkedin.com/in/minudika



On Tue, Mar 1, 2016 at 12:35 AM, Sabarish Sasidharan <
sabarish.sasidha...@manthan.com> wrote:

> Zeppelin?
>
> Regards
> Sab
> On 01-Mar-2016 12:27 am, "Mich Talebzadeh" 
> wrote:
>
>> Hi,
>>
>> Is there such thing as Spark for client much like RDBMS client that have
>> cut down version of their big brother useful for client connectivity but
>> cannot be used as server.
>>
>> Thanks
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>


Re: Spark for client

2016-02-29 Thread Sabarish Sasidharan
Zeppelin?

Regards
Sab
On 01-Mar-2016 12:27 am, "Mich Talebzadeh" 
wrote:

> Hi,
>
> Is there such thing as Spark for client much like RDBMS client that have
> cut down version of their big brother useful for client connectivity but
> cannot be used as server.
>
> Thanks
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>


Spark for client

2016-02-29 Thread Mich Talebzadeh
Hi,

Is there such thing as Spark for client much like RDBMS client that have
cut down version of their big brother useful for client connectivity but
cannot be used as server.

Thanks


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


Re: LDA topic Modeling spark + python

2016-02-29 Thread Bryan Cutler
The input into LDA.train needs to be an RDD of a list with the first
element an integer (id) and the second a pyspark.mllib.Vector object
containing real numbers (term counts), i.e. an RDD of [doc_id,
vector_of_counts].

>From your example, it looks like your corpus is a list with an zero-based
id, with the second element a tuple of user id and list of lines from the
data that have that user_id, something like [doc_id, (user_id, [line0,
line1])]

You need to make that element a Vector containing real numbers somehow.

On Sun, Feb 28, 2016 at 11:08 PM, Mishra, Abhishek <
abhishek.mis...@xerox.com> wrote:

> Hello Bryan,
>
>
>
> Thank you for the update on Jira. I took your code and tried with mine.
> But I get an error with the vector being created. Please see my code below
> and suggest me.
>
> My input file has some contents like this:
>
> "user_id","status"
>
> "0026c10bbbc7eeb55a61ab696ca93923","http:
> www.youtube.com//watch?v=n3nPiBai66M&feature=related **bobsnewline**
> tiftakar, Trudy Darmanin  <3?"
>
> "0026c10bbbc7eeb55a61ab696ca93923","Brandon Cachia ,All I know is
> that,you're so nice."
>
> "0026c10bbbc7eeb55a61ab696ca93923","Melissa Zejtunija:HAM AND CHEESE BIEX
> INI??? **bobsnewline**  Kirr:bit tigieg mel **bobsnewline**  Melissa
> Zejtunija :jaq le mandix aptit tigieg **bobsnewline**  Kirr:int bis
> serjeta?"
>
> "0026c10bbbc7eeb55a61ab696ca93923",".Where is my mind?"
>
>
>
> And what I am doing in my code is like this:
>
>
>
> import string
>
> from pyspark.sql import SQLContext
>
> from pyspark import SparkConf, SparkContext
>
> from pyspark.sql import SQLContext
>
> from pyspark.mllib.clustering import LDA, LDAModel
>
> from nltk.tokenize import word_tokenize
>
> from stop_words import get_stop_words
>
> from nltk.stem.porter import PorterStemmer
>
> from gensim import corpora, models
>
> import gensim
>
> import textmining
>
> import pandas as pd
>
> conf = SparkConf().setAppName("building a warehouse")
>
> sc = SparkContext(conf=conf)
>
> sql_sc = SQLContext(sc)
>
> data = sc.textFile('file:///home/cloudera/LDA-Model/Pyspark/test1.csv')
>
> header = data.first() #extract header
>
> print header
>
> data = data.filter(lambda x:x !=header)#filter out header
>
> pairs = data.map(lambda x: (x.split(',')[0], x))#.collect()#generate pair
> rdd key value
>
> #data11=data.subtractByKey(header)
>
> #print pairs.collect()
>
> #grouped=pairs.map(lambda (x,y): (x, [y])).reduceByKey(lambda a, b: a + b)
>
> grouped=pairs.groupByKey()#grouping values as per key
>
> #print grouped.collectAsMap()
>
> grouped_val= grouped.map(lambda x : (list(x[1]))).collect()
>
> #rr=grouped_val.map(lambda (x,y):(x,[y]))
>
> #df_grouped_val=sql_sc.createDataFrame(rr, ["user_id", "status"])
>
> #print list(enumerate(grouped_val))
>
> #corpus = grouped.zipWithIndex().map(lambda x: [x[1],
> x[0]]).cache()#.collect()
>
> corpus = grouped.zipWithIndex().map(lambda (term_counts, doc_id): [doc_id,
> term_counts]).cache()
>
> #corpus.cache()
>
> model = LDA.train(corpus, k=10, maxIterations=10, optimizer="online")
>
> #ldaModel = LDA.train(corpus, k=3)
>
> print corpus
>
> topics = model.describeTopics(3)
>
> print("\"topic\", \"termIndices\", \"termWeights\"")
>
> for i, t in enumerate(topics):
>
>print("%d, %s, %s" % (i, str(t[0]), str(t[1])))
>
>
>
> sc.stop()
>
>
>
>
>
> Please help me in this
>
> Abhishek
>
>
>
> *From:* Bryan Cutler [mailto:cutl...@gmail.com]
> *Sent:* Friday, February 26, 2016 4:17 AM
> *To:* Mishra, Abhishek
> *Cc:* user@spark.apache.org
> *Subject:* Re: LDA topic Modeling spark + python
>
>
>
> I'm not exactly sure how you would like to setup your LDA model, but I
> noticed there was no Python example for LDA in Spark.  I created this issue
> to add it https://issues.apache.org/jira/browse/SPARK-13500.  Keep an eye
> on this if it could be of help.
>
> bryan
>
>
>
> On Wed, Feb 24, 2016 at 8:34 PM, Mishra, Abhishek <
> abhishek.mis...@xerox.com> wrote:
>
> Hello All,
>
>
>
> If someone has any leads on this please help me.
>
>
>
> Sincerely,
>
> Abhishek
>
>
>
> *From:* Mishra, Abhishek
> *Sent:* Wednesday, February 24, 2016 5:11 PM
> *To:* user@spark.apache.org
> *Subject:* LDA topic Modeling spark + python
>
>
>
> Hello All,
>
>
>
>
>
> I am doing a LDA model, please guide me with something.
>
>
>
> I have a csv file which has two column "user_id" and "status". I have to
> generate a word-topic distribution after aggregating the user_id. Meaning
> to say I need to model it for users on their grouped status. The topic
> length being 2000 and value of k or number of words being 3.
>
>
>
> Please, if you can provide me with some link or some code base on spark
> with python ; I would be grateful.
>
>
>
>
>
> Looking forward for a  reply,
>
>
>
> Sincerely,
>
> Abhishek
>
>
>
>
>


Re: Spark 1.5 on Mesos

2016-02-29 Thread Sathish Kumaran Vairavelu
May be the Mesos executor couldn't find spark image or the constraints are
not satisfied. Check your Mesos UI if you see Spark application in the
Frameworks tab
On Mon, Feb 29, 2016 at 12:23 PM Ashish Soni  wrote:

> What is the Best practice , I have everything running as docker container
> in single host ( mesos and marathon also as docker container )  and
> everything comes up fine but when i try to launch the spark shell i get
> below error
>
>
> SQL context available as sqlContext.
>
> scala> val data = sc.parallelize(1 to 100)
> data: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at
> parallelize at :27
>
> scala> data.count
> [Stage 0:>  (0 +
> 0) / 2]16/02/29 18:21:12 WARN TaskSchedulerImpl: Initial job has not
> accepted any resources; check your cluster UI to ensure that workers are
> registered and have sufficient resources
> 16/02/29 18:21:27 WARN TaskSchedulerImpl: Initial job has not accepted any
> resources; check your cluster UI to ensure that workers are registered and
> have sufficient resources
>
>
>
> On Mon, Feb 29, 2016 at 12:04 PM, Tim Chen  wrote:
>
>> No you don't have to run Mesos in docker containers to run Spark in
>> docker containers.
>>
>> Once you have Mesos cluster running you can then specfiy the Spark
>> configurations in your Spark job (i.e: 
>> spark.mesos.executor.docker.image=mesosphere/spark:1.6)
>> and Mesos will automatically launch docker containers for you.
>>
>> Tim
>>
>> On Mon, Feb 29, 2016 at 7:36 AM, Ashish Soni 
>> wrote:
>>
>>> Yes i read that and not much details here.
>>>
>>> Is it true that we need to have spark installed on each mesos docker
>>> container ( master and slave ) ...
>>>
>>> Ashish
>>>
>>> On Fri, Feb 26, 2016 at 2:14 PM, Tim Chen  wrote:
>>>
 https://spark.apache.org/docs/latest/running-on-mesos.html should be
 the best source, what problems were you running into?

 Tim

 On Fri, Feb 26, 2016 at 11:06 AM, Yin Yang  wrote:

> Have you read this ?
> https://spark.apache.org/docs/latest/running-on-mesos.html
>
> On Fri, Feb 26, 2016 at 11:03 AM, Ashish Soni 
> wrote:
>
>> Hi All ,
>>
>> Is there any proper documentation as how to run spark on mesos , I am
>> trying from the last few days and not able to make it work.
>>
>> Please help
>>
>> Ashish
>>
>
>

>>>
>>
>


Re: Unresolved dep when building project with spark 1.6

2016-02-29 Thread Josh Rosen
Have you tried removing the leveldbjni files from your local ivy cache? My
hunch is that this is a problem with some local cache state rather than the
dependency simply being unavailable / not existing (note that the error
message was "origin location must be absolute:[...]", not that the files
couldn't be found).

On Mon, Feb 29, 2016 at 2:19 AM Hao Ren  wrote:

> Hi,
>
> I am upgrading my project to spark 1.6.
> It seems that the deps are broken.
>
> Deps used in sbt
>
> val scalaVersion = "2.10"
> val sparkVersion  = "1.6.0"
> val hadoopVersion = "2.7.1"
>
> // Libraries
> val scalaTest = "org.scalatest" %% "scalatest" % "2.2.4" % "test"
> val sparkSql  = "org.apache.spark" %% "spark-sql" % sparkVersion
> val sparkML   = "org.apache.spark" %% "spark-mllib" % sparkVersion
> val hadoopAWS = "org.apache.hadoop" % "hadoop-aws" % hadoopVersion
> val scopt = "com.github.scopt" %% "scopt" % "3.3.0"
> val jodacvt   = "org.joda" % "joda-convert" % "1.8.1"
>
> Sbt exception:
>
> [warn] ::
> [warn] ::  UNRESOLVED DEPENDENCIES ::
> [warn] ::
> [warn] :: org.fusesource.leveldbjni#leveldbjni-all;1.8:
> org.fusesource.leveldbjni#leveldbjni-all;1.8!leveldbjni-all.pom(pom.original)
> origin location must be absolute:
> file:/home/invkrh/.m2/repository/org/fusesource/leveldbjni/leveldbjni-all/1.8/leveldbjni-all-1.8.pom
> [warn] ::
> [warn]
> [warn] Note: Unresolved dependencies path:
> [warn] org.fusesource.leveldbjni:leveldbjni-all:1.8
> [warn]  +- org.apache.spark:spark-network-shuffle_2.10:1.6.0
> [warn]  +- org.apache.spark:spark-core_2.10:1.6.0
> [warn]  +- org.apache.spark:spark-catalyst_2.10:1.6.0
> [warn]  +- org.apache.spark:spark-sql_2.10:1.6.0
> (/home/invkrh/workspace/scala/confucius/botdet/build.sbt#L14)
> [warn]  +- org.apache.spark:spark-mllib_2.10:1.6.0
> (/home/invkrh/workspace/scala/confucius/botdet/build.sbt#L14)
> [warn]  +- fr.leboncoin:botdet_2.10:0.1
> sbt.ResolveException: unresolved dependency:
> org.fusesource.leveldbjni#leveldbjni-all;1.8:
> org.fusesource.leveldbjni#leveldbjni-all;1.8!leveldbjni-all.pom(pom.original)
> origin location must be absolute:
> file:/home/invkrh/.m2/repository/org/fusesource/leveldbjni/leveldbjni-all/1.8/leveldbjni-all-1.8.pom
>
> Thank you.
>
> --
> Hao Ren
>
> Data Engineer @ leboncoin
>
> Paris, France
>


Re: Spark 1.5 on Mesos

2016-02-29 Thread Ashish Soni
What is the Best practice , I have everything running as docker container
in single host ( mesos and marathon also as docker container )  and
everything comes up fine but when i try to launch the spark shell i get
below error


SQL context available as sqlContext.

scala> val data = sc.parallelize(1 to 100)
data: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at
parallelize at :27

scala> data.count
[Stage 0:>  (0 + 0)
/ 2]16/02/29 18:21:12 WARN TaskSchedulerImpl: Initial job has not accepted
any resources; check your cluster UI to ensure that workers are registered
and have sufficient resources
16/02/29 18:21:27 WARN TaskSchedulerImpl: Initial job has not accepted any
resources; check your cluster UI to ensure that workers are registered and
have sufficient resources



On Mon, Feb 29, 2016 at 12:04 PM, Tim Chen  wrote:

> No you don't have to run Mesos in docker containers to run Spark in docker
> containers.
>
> Once you have Mesos cluster running you can then specfiy the Spark
> configurations in your Spark job (i.e: 
> spark.mesos.executor.docker.image=mesosphere/spark:1.6)
> and Mesos will automatically launch docker containers for you.
>
> Tim
>
> On Mon, Feb 29, 2016 at 7:36 AM, Ashish Soni 
> wrote:
>
>> Yes i read that and not much details here.
>>
>> Is it true that we need to have spark installed on each mesos docker
>> container ( master and slave ) ...
>>
>> Ashish
>>
>> On Fri, Feb 26, 2016 at 2:14 PM, Tim Chen  wrote:
>>
>>> https://spark.apache.org/docs/latest/running-on-mesos.html should be
>>> the best source, what problems were you running into?
>>>
>>> Tim
>>>
>>> On Fri, Feb 26, 2016 at 11:06 AM, Yin Yang  wrote:
>>>
 Have you read this ?
 https://spark.apache.org/docs/latest/running-on-mesos.html

 On Fri, Feb 26, 2016 at 11:03 AM, Ashish Soni 
 wrote:

> Hi All ,
>
> Is there any proper documentation as how to run spark on mesos , I am
> trying from the last few days and not able to make it work.
>
> Please help
>
> Ashish
>


>>>
>>
>


Re: spark 1.6 new memory management - some issues with tasks not using all executors

2016-02-29 Thread Koert Kuipers
setting spark.shuffle.reduceLocality.enabled=false worked for me, thanks


is there any reference to the benefits of setting reduceLocality to true? i
am tempted to disable it across the board.

On Mon, Feb 29, 2016 at 9:51 AM, Yin Yang  wrote:

> The default value for spark.shuffle.reduceLocality.enabled is true.
>
> To reduce surprise to users of 1.5 and earlier releases, should the
> default value be set to false ?
>
> On Mon, Feb 29, 2016 at 5:38 AM, Lior Chaga  wrote:
>
>> Hi Koret,
>> Try spark.shuffle.reduceLocality.enabled=false
>> This is an undocumented configuration.
>> See:
>> https://github.com/apache/spark/pull/8280
>> https://issues.apache.org/jira/browse/SPARK-10567
>>
>> It solved the problem for me (both with and without memory legacy mode)
>>
>>
>> On Sun, Feb 28, 2016 at 11:16 PM, Koert Kuipers 
>> wrote:
>>
>>> i find it particularly confusing that a new memory management module
>>> would change the locations. its not like the hash partitioner got replaced.
>>> i can switch back and forth between legacy and "new" memory management and
>>> see the distribution change... fully reproducible
>>>
>>> On Sun, Feb 28, 2016 at 11:24 AM, Lior Chaga  wrote:
>>>
 Hi,
 I've experienced a similar problem upgrading from spark 1.4 to spark
 1.6.
 The data is not evenly distributed across executors, but in my case it
 also reproduced with legacy mode.
 Also tried 1.6.1 rc-1, with same results.

 Still looking for resolution.

 Lior

 On Fri, Feb 19, 2016 at 2:01 AM, Koert Kuipers 
 wrote:

> looking at the cached rdd i see a similar story:
> with useLegacyMode = true the cached rdd is spread out across 10
> executors, but with useLegacyMode = false the data for the cached rdd sits
> on only 3 executors (the rest all show 0s). my cached RDD is a key-value
> RDD that got partitioned (hash partitioner, 50 partitions) before being
> cached.
>
> On Thu, Feb 18, 2016 at 6:51 PM, Koert Kuipers 
> wrote:
>
>> hello all,
>> we are just testing a semi-realtime application (it should return
>> results in less than 20 seconds from cached RDDs) on spark 1.6.0. before
>> this it used to run on spark 1.5.1
>>
>> in spark 1.6.0 the performance is similar to 1.5.1 if i set
>> spark.memory.useLegacyMode = true, however if i switch to
>> spark.memory.useLegacyMode = false the queries take about 50% to 100% 
>> more
>> time.
>>
>> the issue becomes clear when i focus on a single stage: the
>> individual tasks are not slower at all, but they run on less executors.
>> in my test query i have 50 tasks and 10 executors. both with
>> useLegacyMode = true and useLegacyMode = false the tasks finish in about 
>> 3
>> seconds and show as running PROCESS_LOCAL. however when  useLegacyMode =
>> false the tasks run on just 3 executors out of 10, while with 
>> useLegacyMode
>> = true they spread out across 10 executors. all the tasks running on 
>> just a
>> few executors leads to the slower results.
>>
>> any idea why this would happen?
>> thanks! koert
>>
>>
>>
>

>>>
>>
>


Re: Flattening Data within DataFrames

2016-02-29 Thread Michał Zieliński
Hi Kevin,

This should help:
https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-spark.html

On 29 February 2016 at 16:54, Kevin Mellott 
wrote:

> Fellow Sparkers,
>
> I'm trying to "flatten" my view of data within a DataFrame, and am having
> difficulties doing so. The DataFrame contains product information, which
> includes multiple levels of categories (primary, secondary, etc).
>
> *Example Data (Raw):*
> *NameLevelCategory*
> Baked CodeFood 1
> Baked CodeSeafood 2
> Baked CodeFish   3
> Hockey Stick  Sports1
> Hockey Stick  Hockey  2
> Hockey Stick  Equipment  3
>
> *Desired Data:*
> *NameCategory1 Category2 Category3*
> Baked CodeFood  Seafood Fish
> Hockey Stick  SportsHockey  Equipment
>
> *Approach:*
> After parsing the "raw" information into two separate DataFrames (called 
> *products
> *and *categories*) and registering them as a Spark SQL tables, I was
> attempting to perform the following query to flatten this all into the
> "desired data" (depicted above).
>
> products.registerTempTable("products")
> categories.registerTempTable("categories")
>
> val productList = sqlContext.sql(
>   " SELECT p.Name, " +
>   " c1.Description AS Category1, " +
>   " c2.Description AS Category2, " +
>   " c3.Description AS Category3 " +
>   " FROM products AS p " +
>   "   JOIN categories AS c1 " +
>   " ON c1.Name = p.Name AND c1.Level = '1' "
>   "   JOIN categories AS c2 " +
>   " ON c2.Name = p.Name AND c2.Level = '2' "
>   "   JOIN categories AS c3 " +
>   " ON c3.Name = p.Name AND c3.Level = '3' "
>
> *Issue:*
> I get an error when running my query above, because I am not able to JOIN
> the *categories* table more than once. Has anybody dealt with this type
> of use case before, and if so how did you achieve the desired behavior?
>
> Thank you in advance for your thoughts.
>
> Kevin
>


Re: Spark 1.5 on Mesos

2016-02-29 Thread Tim Chen
No you don't have to run Mesos in docker containers to run Spark in docker
containers.

Once you have Mesos cluster running you can then specfiy the Spark
configurations in your Spark job (i.e:
spark.mesos.executor.docker.image=mesosphere/spark:1.6)
and Mesos will automatically launch docker containers for you.

Tim

On Mon, Feb 29, 2016 at 7:36 AM, Ashish Soni  wrote:

> Yes i read that and not much details here.
>
> Is it true that we need to have spark installed on each mesos docker
> container ( master and slave ) ...
>
> Ashish
>
> On Fri, Feb 26, 2016 at 2:14 PM, Tim Chen  wrote:
>
>> https://spark.apache.org/docs/latest/running-on-mesos.html should be the
>> best source, what problems were you running into?
>>
>> Tim
>>
>> On Fri, Feb 26, 2016 at 11:06 AM, Yin Yang  wrote:
>>
>>> Have you read this ?
>>> https://spark.apache.org/docs/latest/running-on-mesos.html
>>>
>>> On Fri, Feb 26, 2016 at 11:03 AM, Ashish Soni 
>>> wrote:
>>>
 Hi All ,

 Is there any proper documentation as how to run spark on mesos , I am
 trying from the last few days and not able to make it work.

 Please help

 Ashish

>>>
>>>
>>
>


Optimizing cartesian product using keys

2016-02-29 Thread eahlberg
Hello,

To avoid computing all possible combinations, I'm trying to group values
according to a certain key, and then compute the cartesian product of the
values for each key, i.e.:

Input
 [(k1, [v1]), (k1, [v2]), (k2, [v3])]

Desired output:
[(v1, v1), (v1, v2), (v2, v2), (v2, v1), (v3, v3)]

Currently I'm doing it as follows (product is from Python itertools):
input = sc.textFile('data.csv')
rdd = input.map(lambda x: (x.key, [x]))
rdd2 = rdd.reduceByKey(lambda x, y: x + y)
rdd3 = rdd2.flatMapValues(lambda x: itertools.product(x, x))
result = rdd3.map(lambda x: x[1])

This works fine for very small files, but when the list is of length ~1000
the computation completely freezes.

Thanks in advance!




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Optimizing-cartesian-product-using-keys-tp26361.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Flattening Data within DataFrames

2016-02-29 Thread Kevin Mellott
Fellow Sparkers,

I'm trying to "flatten" my view of data within a DataFrame, and am having
difficulties doing so. The DataFrame contains product information, which
includes multiple levels of categories (primary, secondary, etc).

*Example Data (Raw):*
*NameLevelCategory*
Baked CodeFood 1
Baked CodeSeafood 2
Baked CodeFish   3
Hockey Stick  Sports1
Hockey Stick  Hockey  2
Hockey Stick  Equipment  3

*Desired Data:*
*NameCategory1 Category2 Category3*
Baked CodeFood  Seafood Fish
Hockey Stick  SportsHockey  Equipment

*Approach:*
After parsing the "raw" information into two separate DataFrames
(called *products
*and *categories*) and registering them as a Spark SQL tables, I was
attempting to perform the following query to flatten this all into the
"desired data" (depicted above).

products.registerTempTable("products")
categories.registerTempTable("categories")

val productList = sqlContext.sql(
  " SELECT p.Name, " +
  " c1.Description AS Category1, " +
  " c2.Description AS Category2, " +
  " c3.Description AS Category3 " +
  " FROM products AS p " +
  "   JOIN categories AS c1 " +
  " ON c1.Name = p.Name AND c1.Level = '1' "
  "   JOIN categories AS c2 " +
  " ON c2.Name = p.Name AND c2.Level = '2' "
  "   JOIN categories AS c3 " +
  " ON c3.Name = p.Name AND c3.Level = '3' "

*Issue:*
I get an error when running my query above, because I am not able to JOIN
the *categories* table more than once. Has anybody dealt with this type of
use case before, and if so how did you achieve the desired behavior?

Thank you in advance for your thoughts.

Kevin


Re: [MLlib] How to set Loss to Gradient Boosted Tree in Java

2016-02-29 Thread diplomatic Guru
Thank you very much Kevin.



On 29 February 2016 at 16:20, Kevin Mellott 
wrote:

> I found a helper class that I think should do the trick. Take a look at
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/loss/Losses.scala
>
> When passing the Loss, you should be able to do something like:
>
> Losses.fromString("leastSquaresError")
>
> On Mon, Feb 29, 2016 at 10:03 AM, diplomatic Guru <
> diplomaticg...@gmail.com> wrote:
>
>> It's strange as you are correct the doc does state it. But it's
>> complaining about the constructor.
>>
>> When I clicked on the org.apache.spark.mllib.tree.loss.AbsoluteError
>> class, this is what I see:
>>
>>
>> @Since("1.2.0")
>> @DeveloperApi
>> object AbsoluteError extends Loss {
>>
>>   /**
>>* Method to calculate the gradients for the gradient boosting
>> calculation for least
>>* absolute error calculation.
>>* The gradient with respect to F(x) is: sign(F(x) - y)
>>* @param prediction Predicted label.
>>* @param label True label.
>>* @return Loss gradient
>>*/
>>   @Since("1.2.0")
>>   override def gradient(prediction: Double, label: Double): Double = {
>> if (label - prediction < 0) 1.0 else -1.0
>>   }
>>
>>   override private[mllib] def computeError(prediction: Double, label:
>> Double): Double = {
>> val err = label - prediction
>> math.abs(err)
>>   }
>> }
>>
>>
>> On 29 February 2016 at 15:49, Kevin Mellott 
>> wrote:
>>
>>> Looks like it should be present in 1.3 at
>>> org.apache.spark.mllib.tree.loss.AbsoluteError
>>>
>>>
>>> spark.apache.org/docs/1.3.0/api/java/org/apache/spark/mllib/tree/loss/AbsoluteError.html
>>>
>>> On Mon, Feb 29, 2016 at 9:46 AM, diplomatic Guru <
>>> diplomaticg...@gmail.com> wrote:
>>>
 AbsoluteError() constructor is undefined.

 I'm using Spark 1.3.0, maybe it is not ready for this version?



 On 29 February 2016 at 15:38, Kevin Mellott 
 wrote:

> I believe that you can instantiate an instance of the AbsoluteError
> class for the *Loss* object, since that object implements the Loss
> interface. For example.
>
> val loss = new AbsoluteError()
> boostingStrategy.setLoss(loss)
>
> On Mon, Feb 29, 2016 at 9:33 AM, diplomatic Guru <
> diplomaticg...@gmail.com> wrote:
>
>> Hi Kevin,
>>
>> Yes, I've set the bootingStrategy like that using the example. But
>> I'm not sure how to create and pass the Loss object.
>>
>> e.g
>>
>> boostingStrategy.setLoss(..);
>>
>> Not sure how to pass the selected Loss.
>>
>> How do I set the  Absolute Error in setLoss() function?
>>
>>
>>
>>
>> On 29 February 2016 at 15:26, Kevin Mellott <
>> kevin.r.mell...@gmail.com> wrote:
>>
>>> You can use the constructor that accepts a BoostingStrategy object,
>>> which will allow you to set the tree strategy (and other 
>>> hyperparameters as
>>> well).
>>>
>>> *GradientBoostedTrees
>>> *
>>> (BoostingStrategy
>>> 
>>>  boostingStrategy)
>>>
>>> On Mon, Feb 29, 2016 at 9:21 AM, diplomatic Guru <
>>> diplomaticg...@gmail.com> wrote:
>>>
 Hello guys,

 I think the default Loss algorithm is Squared Error for regression,
 but how do I change that to Absolute Error in Java.

 Could you please show me an example?



>>>
>>
>

>>>
>>
>


Re: [MLlib] How to set Loss to Gradient Boosted Tree in Java

2016-02-29 Thread Kevin Mellott
I found a helper class that I think should do the trick. Take a look at
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/loss/Losses.scala

When passing the Loss, you should be able to do something like:

Losses.fromString("leastSquaresError")

On Mon, Feb 29, 2016 at 10:03 AM, diplomatic Guru 
wrote:

> It's strange as you are correct the doc does state it. But it's
> complaining about the constructor.
>
> When I clicked on the org.apache.spark.mllib.tree.loss.AbsoluteError
> class, this is what I see:
>
>
> @Since("1.2.0")
> @DeveloperApi
> object AbsoluteError extends Loss {
>
>   /**
>* Method to calculate the gradients for the gradient boosting
> calculation for least
>* absolute error calculation.
>* The gradient with respect to F(x) is: sign(F(x) - y)
>* @param prediction Predicted label.
>* @param label True label.
>* @return Loss gradient
>*/
>   @Since("1.2.0")
>   override def gradient(prediction: Double, label: Double): Double = {
> if (label - prediction < 0) 1.0 else -1.0
>   }
>
>   override private[mllib] def computeError(prediction: Double, label:
> Double): Double = {
> val err = label - prediction
> math.abs(err)
>   }
> }
>
>
> On 29 February 2016 at 15:49, Kevin Mellott 
> wrote:
>
>> Looks like it should be present in 1.3 at
>> org.apache.spark.mllib.tree.loss.AbsoluteError
>>
>>
>> spark.apache.org/docs/1.3.0/api/java/org/apache/spark/mllib/tree/loss/AbsoluteError.html
>>
>> On Mon, Feb 29, 2016 at 9:46 AM, diplomatic Guru <
>> diplomaticg...@gmail.com> wrote:
>>
>>> AbsoluteError() constructor is undefined.
>>>
>>> I'm using Spark 1.3.0, maybe it is not ready for this version?
>>>
>>>
>>>
>>> On 29 February 2016 at 15:38, Kevin Mellott 
>>> wrote:
>>>
 I believe that you can instantiate an instance of the AbsoluteError
 class for the *Loss* object, since that object implements the Loss
 interface. For example.

 val loss = new AbsoluteError()
 boostingStrategy.setLoss(loss)

 On Mon, Feb 29, 2016 at 9:33 AM, diplomatic Guru <
 diplomaticg...@gmail.com> wrote:

> Hi Kevin,
>
> Yes, I've set the bootingStrategy like that using the example. But I'm
> not sure how to create and pass the Loss object.
>
> e.g
>
> boostingStrategy.setLoss(..);
>
> Not sure how to pass the selected Loss.
>
> How do I set the  Absolute Error in setLoss() function?
>
>
>
>
> On 29 February 2016 at 15:26, Kevin Mellott  > wrote:
>
>> You can use the constructor that accepts a BoostingStrategy object,
>> which will allow you to set the tree strategy (and other hyperparameters 
>> as
>> well).
>>
>> *GradientBoostedTrees
>> *
>> (BoostingStrategy
>> 
>>  boostingStrategy)
>>
>> On Mon, Feb 29, 2016 at 9:21 AM, diplomatic Guru <
>> diplomaticg...@gmail.com> wrote:
>>
>>> Hello guys,
>>>
>>> I think the default Loss algorithm is Squared Error for regression,
>>> but how do I change that to Absolute Error in Java.
>>>
>>> Could you please show me an example?
>>>
>>>
>>>
>>
>

>>>
>>
>


Re: [MLlib] How to set Loss to Gradient Boosted Tree in Java

2016-02-29 Thread diplomatic Guru
It's strange as you are correct the doc does state it. But it's complaining
about the constructor.

When I clicked on the org.apache.spark.mllib.tree.loss.AbsoluteError class,
this is what I see:


@Since("1.2.0")
@DeveloperApi
object AbsoluteError extends Loss {

  /**
   * Method to calculate the gradients for the gradient boosting
calculation for least
   * absolute error calculation.
   * The gradient with respect to F(x) is: sign(F(x) - y)
   * @param prediction Predicted label.
   * @param label True label.
   * @return Loss gradient
   */
  @Since("1.2.0")
  override def gradient(prediction: Double, label: Double): Double = {
if (label - prediction < 0) 1.0 else -1.0
  }

  override private[mllib] def computeError(prediction: Double, label:
Double): Double = {
val err = label - prediction
math.abs(err)
  }
}


On 29 February 2016 at 15:49, Kevin Mellott 
wrote:

> Looks like it should be present in 1.3 at
> org.apache.spark.mllib.tree.loss.AbsoluteError
>
>
> spark.apache.org/docs/1.3.0/api/java/org/apache/spark/mllib/tree/loss/AbsoluteError.html
>
> On Mon, Feb 29, 2016 at 9:46 AM, diplomatic Guru  > wrote:
>
>> AbsoluteError() constructor is undefined.
>>
>> I'm using Spark 1.3.0, maybe it is not ready for this version?
>>
>>
>>
>> On 29 February 2016 at 15:38, Kevin Mellott 
>> wrote:
>>
>>> I believe that you can instantiate an instance of the AbsoluteError
>>> class for the *Loss* object, since that object implements the Loss
>>> interface. For example.
>>>
>>> val loss = new AbsoluteError()
>>> boostingStrategy.setLoss(loss)
>>>
>>> On Mon, Feb 29, 2016 at 9:33 AM, diplomatic Guru <
>>> diplomaticg...@gmail.com> wrote:
>>>
 Hi Kevin,

 Yes, I've set the bootingStrategy like that using the example. But I'm
 not sure how to create and pass the Loss object.

 e.g

 boostingStrategy.setLoss(..);

 Not sure how to pass the selected Loss.

 How do I set the  Absolute Error in setLoss() function?




 On 29 February 2016 at 15:26, Kevin Mellott 
 wrote:

> You can use the constructor that accepts a BoostingStrategy object,
> which will allow you to set the tree strategy (and other hyperparameters 
> as
> well).
>
> *GradientBoostedTrees
> *
> (BoostingStrategy
> 
>  boostingStrategy)
>
> On Mon, Feb 29, 2016 at 9:21 AM, diplomatic Guru <
> diplomaticg...@gmail.com> wrote:
>
>> Hello guys,
>>
>> I think the default Loss algorithm is Squared Error for regression,
>> but how do I change that to Absolute Error in Java.
>>
>> Could you please show me an example?
>>
>>
>>
>

>>>
>>
>


Re: [MLlib] How to set Loss to Gradient Boosted Tree in Java

2016-02-29 Thread Kevin Mellott
Looks like it should be present in 1.3 at
org.apache.spark.mllib.tree.loss.AbsoluteError

spark.apache.org/docs/1.3.0/api/java/org/apache/spark/mllib/tree/loss/AbsoluteError.html

On Mon, Feb 29, 2016 at 9:46 AM, diplomatic Guru 
wrote:

> AbsoluteError() constructor is undefined.
>
> I'm using Spark 1.3.0, maybe it is not ready for this version?
>
>
>
> On 29 February 2016 at 15:38, Kevin Mellott 
> wrote:
>
>> I believe that you can instantiate an instance of the AbsoluteError class
>> for the *Loss* object, since that object implements the Loss interface.
>> For example.
>>
>> val loss = new AbsoluteError()
>> boostingStrategy.setLoss(loss)
>>
>> On Mon, Feb 29, 2016 at 9:33 AM, diplomatic Guru <
>> diplomaticg...@gmail.com> wrote:
>>
>>> Hi Kevin,
>>>
>>> Yes, I've set the bootingStrategy like that using the example. But I'm
>>> not sure how to create and pass the Loss object.
>>>
>>> e.g
>>>
>>> boostingStrategy.setLoss(..);
>>>
>>> Not sure how to pass the selected Loss.
>>>
>>> How do I set the  Absolute Error in setLoss() function?
>>>
>>>
>>>
>>>
>>> On 29 February 2016 at 15:26, Kevin Mellott 
>>> wrote:
>>>
 You can use the constructor that accepts a BoostingStrategy object,
 which will allow you to set the tree strategy (and other hyperparameters as
 well).

 *GradientBoostedTrees
 *
 (BoostingStrategy
 
  boostingStrategy)

 On Mon, Feb 29, 2016 at 9:21 AM, diplomatic Guru <
 diplomaticg...@gmail.com> wrote:

> Hello guys,
>
> I think the default Loss algorithm is Squared Error for regression,
> but how do I change that to Absolute Error in Java.
>
> Could you please show me an example?
>
>
>

>>>
>>
>


Re: [MLlib] How to set Loss to Gradient Boosted Tree in Java

2016-02-29 Thread diplomatic Guru
AbsoluteError() constructor is undefined.

I'm using Spark 1.3.0, maybe it is not ready for this version?



On 29 February 2016 at 15:38, Kevin Mellott 
wrote:

> I believe that you can instantiate an instance of the AbsoluteError class
> for the *Loss* object, since that object implements the Loss interface.
> For example.
>
> val loss = new AbsoluteError()
> boostingStrategy.setLoss(loss)
>
> On Mon, Feb 29, 2016 at 9:33 AM, diplomatic Guru  > wrote:
>
>> Hi Kevin,
>>
>> Yes, I've set the bootingStrategy like that using the example. But I'm
>> not sure how to create and pass the Loss object.
>>
>> e.g
>>
>> boostingStrategy.setLoss(..);
>>
>> Not sure how to pass the selected Loss.
>>
>> How do I set the  Absolute Error in setLoss() function?
>>
>>
>>
>>
>> On 29 February 2016 at 15:26, Kevin Mellott 
>> wrote:
>>
>>> You can use the constructor that accepts a BoostingStrategy object,
>>> which will allow you to set the tree strategy (and other hyperparameters as
>>> well).
>>>
>>> *GradientBoostedTrees
>>> *
>>> (BoostingStrategy
>>> 
>>>  boostingStrategy)
>>>
>>> On Mon, Feb 29, 2016 at 9:21 AM, diplomatic Guru <
>>> diplomaticg...@gmail.com> wrote:
>>>
 Hello guys,

 I think the default Loss algorithm is Squared Error for regression, but
 how do I change that to Absolute Error in Java.

 Could you please show me an example?



>>>
>>
>


Re: [MLlib] How to set Loss to Gradient Boosted Tree in Java

2016-02-29 Thread Kevin Mellott
I believe that you can instantiate an instance of the AbsoluteError class
for the *Loss* object, since that object implements the Loss interface. For
example.

val loss = new AbsoluteError()
boostingStrategy.setLoss(loss)

On Mon, Feb 29, 2016 at 9:33 AM, diplomatic Guru 
wrote:

> Hi Kevin,
>
> Yes, I've set the bootingStrategy like that using the example. But I'm not
> sure how to create and pass the Loss object.
>
> e.g
>
> boostingStrategy.setLoss(..);
>
> Not sure how to pass the selected Loss.
>
> How do I set the  Absolute Error in setLoss() function?
>
>
>
>
> On 29 February 2016 at 15:26, Kevin Mellott 
> wrote:
>
>> You can use the constructor that accepts a BoostingStrategy object, which
>> will allow you to set the tree strategy (and other hyperparameters as well).
>>
>> *GradientBoostedTrees
>> *
>> (BoostingStrategy
>> 
>>  boostingStrategy)
>>
>> On Mon, Feb 29, 2016 at 9:21 AM, diplomatic Guru <
>> diplomaticg...@gmail.com> wrote:
>>
>>> Hello guys,
>>>
>>> I think the default Loss algorithm is Squared Error for regression, but
>>> how do I change that to Absolute Error in Java.
>>>
>>> Could you please show me an example?
>>>
>>>
>>>
>>
>


Re: Spark 1.5 on Mesos

2016-02-29 Thread Ashish Soni
Yes i read that and not much details here.

Is it true that we need to have spark installed on each mesos docker
container ( master and slave ) ...

Ashish

On Fri, Feb 26, 2016 at 2:14 PM, Tim Chen  wrote:

> https://spark.apache.org/docs/latest/running-on-mesos.html should be the
> best source, what problems were you running into?
>
> Tim
>
> On Fri, Feb 26, 2016 at 11:06 AM, Yin Yang  wrote:
>
>> Have you read this ?
>> https://spark.apache.org/docs/latest/running-on-mesos.html
>>
>> On Fri, Feb 26, 2016 at 11:03 AM, Ashish Soni 
>> wrote:
>>
>>> Hi All ,
>>>
>>> Is there any proper documentation as how to run spark on mesos , I am
>>> trying from the last few days and not able to make it work.
>>>
>>> Please help
>>>
>>> Ashish
>>>
>>
>>
>


Re: [MLlib] How to set Loss to Gradient Boosted Tree in Java

2016-02-29 Thread diplomatic Guru
Hi Kevin,

Yes, I've set the bootingStrategy like that using the example. But I'm not
sure how to create and pass the Loss object.

e.g

boostingStrategy.setLoss(..);

Not sure how to pass the selected Loss.

How do I set the  Absolute Error in setLoss() function?




On 29 February 2016 at 15:26, Kevin Mellott 
wrote:

> You can use the constructor that accepts a BoostingStrategy object, which
> will allow you to set the tree strategy (and other hyperparameters as well).
>
> *GradientBoostedTrees
> *
> (BoostingStrategy
> 
>  boostingStrategy)
>
> On Mon, Feb 29, 2016 at 9:21 AM, diplomatic Guru  > wrote:
>
>> Hello guys,
>>
>> I think the default Loss algorithm is Squared Error for regression, but
>> how do I change that to Absolute Error in Java.
>>
>> Could you please show me an example?
>>
>>
>>
>


Re: kafka + mysql filtering problem

2016-02-29 Thread Cody Koeninger
You're getting confused about what code is running on the driver vs what
code is running on the executor.  Read

http://spark.apache.org/docs/latest/programming-guide.html#understanding-closures-a-nameclosureslinka



On Mon, Feb 29, 2016 at 8:00 AM, franco barrientos <
franco.barrien...@exalitica.com> wrote:

> Hi all,
>
> I want to read some filtering rules from mysql (jdbc mysql driver)
> specifically its a char type containing a field and value to process in a
> kafka streaming input.
>
> The main idea is to process this from a web UI (livy server).
>
> Any suggestion or guidelines?
>
> e.g., I have this:
>
> *object Streaming {*
> *  def main(args: Array[String]) {*
> *if (args.length < 4) {*
> *  System.err.println("Usage: KafkaWordCount  
>  ")*
> *  System.exit(1)*
> *}*
> * val Array(zkQuorum, group, topics, numThreads) = args*
> * var spc = SparkContext.getOrCreate()*
> * val ssc = new StreamingContext(spc, Seconds(3))*
> *val lines = KafkaUtils.createStream(ssc, zkQuorum, group,
> Map(topics -> 5)).map(_._2)*
> * /* TEST MYSQL */*
> * val sqlContext = new SQLContext(spc)*
> * val prop = new java.util.Properties*
> * val url = "jdbc:mysql://52.22.38.81:3306/tmp
> "*
> * val tbl_users = "santander_demo_users"*
> * val tbl_rules = "santander_demo_filters"*
> * val tbl_campaigns = "santander_demo_campaigns"*
> * prop.setProperty("user", "root")*
> * prop.setProperty("password", "Exalitica2014")*
> * val users = sqlContext.read.jdbc(url, tbl_users, prop)*
> * val rules = sqlContext.read.jdbc(url, tbl_rules, prop)*
> * val campaigns = sqlContext.read.jdbc(url, tbl_campaigns, prop)*
> * val toolbox = currentMirror.mkToolBox()*
> * val toRemove = "\"”.toSet*
> *var mto = “0"*
>
> * def rule_apply (n:Int, t:String, rules:DataFrame) : String = {*
> * // reading rules from mysql*
> *  var r = (rules.filter(rules("CID") ===
> n).select("FILTER_DSC").first())(0).toString()*
>
> *  // using mkToolbox for pre-processing rules*
> *   return toolbox.eval(toolbox.parse("""*
> * val mto = """ + t + """*
> * if(""" + r + """) {*
> *   return “true"*
> * } else {*
> *   return “false"*
> *}*
> *   """)).toString()*
> * }*
> * /* TEST MYSQL */*
>
> * lines.map{x =>*
> *  if(x.split(",").length > 1) {*
> *// reading from kafka input*
> *mto = spc.broadcast(x.split(",")(5).filterNot(toRemove))*
> *  }*
> * }*
> * var msg = rule_apply(1, mto, rules)*
> * var word = lines.map(x => msg)*
> * word.print()*
> *ssc.start()*
> *ssc.awaitTermination()*
> *  }*
> *}*
>
> The problem is that *mto* variable always returns to “0” value after
> mapping lines DStream. I tried to process *rule_apply *into map but I get
> not serializable mkToolbox class error.
>
> Thanks in advance.
>
> *Franco Barrientos*
> Data Scientist
>
> Málaga #115, Of. 1003, Las Condes.
> Santiago, Chile.
> (+562)-29699649
> (+569)-76347893
>
> franco.barrien...@exalitica.com
>
> www.exalitica.com
>
> [image: http://exalitica.com/web/img/frim.png]
>


Re: [MLlib] How to set Loss to Gradient Boosted Tree in Java

2016-02-29 Thread Kevin Mellott
You can use the constructor that accepts a BoostingStrategy object, which
will allow you to set the tree strategy (and other hyperparameters as well).

*GradientBoostedTrees
*
(BoostingStrategy

 boostingStrategy)

On Mon, Feb 29, 2016 at 9:21 AM, diplomatic Guru 
wrote:

> Hello guys,
>
> I think the default Loss algorithm is Squared Error for regression, but
> how do I change that to Absolute Error in Java.
>
> Could you please show me an example?
>
>
>


[MLlib] How to set Loss to Gradient Boosted Tree in Java

2016-02-29 Thread diplomatic Guru
Hello guys,

I think the default Loss algorithm is Squared Error for regression, but how
do I change that to Absolute Error in Java.

Could you please show me an example?


Re: spark 1.6 new memory management - some issues with tasks not using all executors

2016-02-29 Thread Yin Yang
The default value for spark.shuffle.reduceLocality.enabled is true.

To reduce surprise to users of 1.5 and earlier releases, should the default
value be set to false ?

On Mon, Feb 29, 2016 at 5:38 AM, Lior Chaga  wrote:

> Hi Koret,
> Try spark.shuffle.reduceLocality.enabled=false
> This is an undocumented configuration.
> See:
> https://github.com/apache/spark/pull/8280
> https://issues.apache.org/jira/browse/SPARK-10567
>
> It solved the problem for me (both with and without memory legacy mode)
>
>
> On Sun, Feb 28, 2016 at 11:16 PM, Koert Kuipers  wrote:
>
>> i find it particularly confusing that a new memory management module
>> would change the locations. its not like the hash partitioner got replaced.
>> i can switch back and forth between legacy and "new" memory management and
>> see the distribution change... fully reproducible
>>
>> On Sun, Feb 28, 2016 at 11:24 AM, Lior Chaga  wrote:
>>
>>> Hi,
>>> I've experienced a similar problem upgrading from spark 1.4 to spark 1.6.
>>> The data is not evenly distributed across executors, but in my case it
>>> also reproduced with legacy mode.
>>> Also tried 1.6.1 rc-1, with same results.
>>>
>>> Still looking for resolution.
>>>
>>> Lior
>>>
>>> On Fri, Feb 19, 2016 at 2:01 AM, Koert Kuipers 
>>> wrote:
>>>
 looking at the cached rdd i see a similar story:
 with useLegacyMode = true the cached rdd is spread out across 10
 executors, but with useLegacyMode = false the data for the cached rdd sits
 on only 3 executors (the rest all show 0s). my cached RDD is a key-value
 RDD that got partitioned (hash partitioner, 50 partitions) before being
 cached.

 On Thu, Feb 18, 2016 at 6:51 PM, Koert Kuipers 
 wrote:

> hello all,
> we are just testing a semi-realtime application (it should return
> results in less than 20 seconds from cached RDDs) on spark 1.6.0. before
> this it used to run on spark 1.5.1
>
> in spark 1.6.0 the performance is similar to 1.5.1 if i set
> spark.memory.useLegacyMode = true, however if i switch to
> spark.memory.useLegacyMode = false the queries take about 50% to 100% more
> time.
>
> the issue becomes clear when i focus on a single stage: the individual
> tasks are not slower at all, but they run on less executors.
> in my test query i have 50 tasks and 10 executors. both with
> useLegacyMode = true and useLegacyMode = false the tasks finish in about 3
> seconds and show as running PROCESS_LOCAL. however when  useLegacyMode =
> false the tasks run on just 3 executors out of 10, while with 
> useLegacyMode
> = true they spread out across 10 executors. all the tasks running on just 
> a
> few executors leads to the slower results.
>
> any idea why this would happen?
> thanks! koert
>
>
>

>>>
>>
>


Re: Spark on Windows platform

2016-02-29 Thread Gaurav Agarwal
> Hi
> I am running spark on windows but a standalone one.
>
> Use this code
>
> SparkConf conf = new
SparkConf().setMaster("local[1]").seatAppName("spark").setSparkHome("c:/spark/bin/spark-submit.cmd");
>
> Where sparkhome is the path where u extracted ur spark binaries till
bin/*.cmd
>
> You will get spark context or streaming context
>
> Thanks
>
> On Feb 29, 2016 7:10 PM, "gaurav pathak" 
wrote:
>>
>> Thanks Jorn.
>>
>> Any guidance on how to get started with getting SPARK on Windows, is
highly appreciated.
>>
>> Thanks & Regards
>>
>> Gaurav Pathak
>>
>> ~ sent from handheld device
>>
>> On Feb 29, 2016 5:34 AM, "Jörn Franke"  wrote:
>>>
>>> I think Hortonworks has a Windows Spark distribution. Maybe Bigtop as
well?
>>>
>>> > On 29 Feb 2016, at 14:27, gaurav pathak 
wrote:
>>> >
>>> > Can someone guide me the steps and information regarding,
installation of SPARK on Windows 7/8.1/10 , as well as on Windows Server.
Also, it will be great to read your experiences in using SPARK on Windows
platform.
>>> >
>>> >
>>> > Thanks & Regards,
>>> > Gaurav Pathak


kafka + mysql filtering problem

2016-02-29 Thread franco barrientos
Hi all,

I want to read some filtering rules from mysql (jdbc mysql driver) specifically 
its a char type containing a field and value to process in a kafka streaming 
input.

The main idea is to process this from a web UI (livy server).

Any suggestion or guidelines?

e.g., I have this:

object Streaming {
  def main(args: Array[String]) {
if (args.length < 4) {
  System.err.println("Usage: KafkaWordCount
")
  System.exit(1)
}
val Array(zkQuorum, group, topics, numThreads) = args
var spc = SparkContext.getOrCreate()
val ssc = new StreamingContext(spc, Seconds(3))
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, Map(topics -> 
5)).map(_._2)
/* TEST MYSQL */
val sqlContext = new SQLContext(spc)
val prop = new java.util.Properties
val url = "jdbc:mysql://52.22.38.81:3306/tmp"
val tbl_users = "santander_demo_users"
val tbl_rules = "santander_demo_filters"
val tbl_campaigns = "santander_demo_campaigns"
prop.setProperty("user", "root")
prop.setProperty("password", "Exalitica2014")
val users = sqlContext.read.jdbc(url, tbl_users, prop)
val rules = sqlContext.read.jdbc(url, tbl_rules, prop)
val campaigns = sqlContext.read.jdbc(url, tbl_campaigns, prop)
val toolbox = currentMirror.mkToolBox()
val toRemove = "\"”.toSet
var mto = “0"

def rule_apply (n:Int, t:String, rules:DataFrame) : String = {
 // reading rules from mysql
  var r = (rules.filter(rules("CID") === 
n).select("FILTER_DSC").first())(0).toString()
  
  // using mkToolbox for pre-processing rules
return toolbox.eval(toolbox.parse("""
  val mto = """ + t + """
  if(""" + r + """) {
return “true"
  } else {
return “false"
}
""")).toString()
}
/* TEST MYSQL */

lines.map{x =>
  if(x.split(",").length > 1) {
// reading from kafka input
mto = spc.broadcast(x.split(",")(5).filterNot(toRemove))
  }
}
var msg = rule_apply(1, mto, rules)
var word = lines.map(x => msg)
word.print()
ssc.start()
ssc.awaitTermination()
  }
}

The problem is that mto variable always returns to “0” value after mapping 
lines DStream. I tried to process rule_apply into map but I get not 
serializable mkToolbox class error.

Thanks in advance.

Franco Barrientos
Data Scientist

Málaga #115, Of. 1003, Las Condes.
Santiago, Chile.
(+562)-29699649
(+569)-76347893

franco.barrien...@exalitica.com 

www.exalitica.com




Re: Spark Integration Patterns

2016-02-29 Thread moshir mikael
Thanks, will check too, however : just want to use Spark core RDD and
standard data sources.


Le lun. 29 févr. 2016 à 14:54, Alex Dzhagriev  a écrit :

> Hi Moshir,
>
> Regarding the streaming, you can take a look at the spark streaming, the
> micro-batching framework. If it satisfies your needs it has a bunch of
> integrations. Thus, the source for the jobs could be Kafka, Flume or Akka.
>
> Cheers, Alex.
>
> On Mon, Feb 29, 2016 at 2:48 PM, moshir mikael 
> wrote:
>
>> Hi Alex,
>> thanks for the link. Will check it.
>> Does someone know of a more streamlined approach ?
>>
>>
>>
>>
>> Le lun. 29 févr. 2016 à 10:28, Alex Dzhagriev  a
>> écrit :
>>
>>> Hi Moshir,
>>>
>>> I think you can use the rest api provided with Spark:
>>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/rest/RestSubmissionServer.scala
>>>
>>> Unfortunately, I haven't find any documentation, but it looks fine.
>>> Thanks, Alex.
>>>
>>> On Sun, Feb 28, 2016 at 3:25 PM, mms  wrote:
>>>
 Hi, I cannot find a simple example showing how a typical application
 can 'connect' to a remote spark cluster and interact with it. Let's say I
 have a Python web application hosted somewhere *outside *a spark
 cluster, with just python installed on it. How can I talk to Spark without
 using a notebook, or using ssh to connect to a cluster master node ? I know
 of spark-submit and spark-shell, however forking a process on a remote host
 to execute a shell script seems like a lot of effort What are the
 recommended ways to connect and query Spark from a remote client ? Thanks
 Thx !
 --
 View this message in context: Spark Integration Patterns
 
 Sent from the Apache Spark User List mailing list archive
  at Nabble.com.

>>>
>>>
>


Re: Spark Integration Patterns

2016-02-29 Thread Alex Dzhagriev
Hi Moshir,

Regarding the streaming, you can take a look at the spark streaming, the
micro-batching framework. If it satisfies your needs it has a bunch of
integrations. Thus, the source for the jobs could be Kafka, Flume or Akka.

Cheers, Alex.

On Mon, Feb 29, 2016 at 2:48 PM, moshir mikael 
wrote:

> Hi Alex,
> thanks for the link. Will check it.
> Does someone know of a more streamlined approach ?
>
>
>
>
> Le lun. 29 févr. 2016 à 10:28, Alex Dzhagriev  a écrit :
>
>> Hi Moshir,
>>
>> I think you can use the rest api provided with Spark:
>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/rest/RestSubmissionServer.scala
>>
>> Unfortunately, I haven't find any documentation, but it looks fine.
>> Thanks, Alex.
>>
>> On Sun, Feb 28, 2016 at 3:25 PM, mms  wrote:
>>
>>> Hi, I cannot find a simple example showing how a typical application can
>>> 'connect' to a remote spark cluster and interact with it. Let's say I have
>>> a Python web application hosted somewhere *outside *a spark cluster,
>>> with just python installed on it. How can I talk to Spark without using a
>>> notebook, or using ssh to connect to a cluster master node ? I know of
>>> spark-submit and spark-shell, however forking a process on a remote host to
>>> execute a shell script seems like a lot of effort What are the recommended
>>> ways to connect and query Spark from a remote client ? Thanks Thx !
>>> --
>>> View this message in context: Spark Integration Patterns
>>> 
>>> Sent from the Apache Spark User List mailing list archive
>>>  at Nabble.com.
>>>
>>
>>


Re: Spark Integration Patterns

2016-02-29 Thread moshir mikael
Hi Alex,
thanks for the link. Will check it.
Does someone know of a more streamlined approach ?




Le lun. 29 févr. 2016 à 10:28, Alex Dzhagriev  a écrit :

> Hi Moshir,
>
> I think you can use the rest api provided with Spark:
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/rest/RestSubmissionServer.scala
>
> Unfortunately, I haven't find any documentation, but it looks fine.
> Thanks, Alex.
>
> On Sun, Feb 28, 2016 at 3:25 PM, mms  wrote:
>
>> Hi, I cannot find a simple example showing how a typical application can
>> 'connect' to a remote spark cluster and interact with it. Let's say I have
>> a Python web application hosted somewhere *outside *a spark cluster,
>> with just python installed on it. How can I talk to Spark without using a
>> notebook, or using ssh to connect to a cluster master node ? I know of
>> spark-submit and spark-shell, however forking a process on a remote host to
>> execute a shell script seems like a lot of effort What are the recommended
>> ways to connect and query Spark from a remote client ? Thanks Thx !
>> --
>> View this message in context: Spark Integration Patterns
>> 
>> Sent from the Apache Spark User List mailing list archive
>>  at Nabble.com.
>>
>
>


Re: Spark on Windows platform

2016-02-29 Thread gaurav pathak
Thanks Jorn.

Any guidance on how to get started with getting SPARK on Windows, is highly
appreciated.

Thanks & Regards

Gaurav Pathak

~ sent from handheld device
On Feb 29, 2016 5:34 AM, "Jörn Franke"  wrote:

> I think Hortonworks has a Windows Spark distribution. Maybe Bigtop as well?
>
> > On 29 Feb 2016, at 14:27, gaurav pathak 
> wrote:
> >
> > Can someone guide me the steps and information regarding, installation
> of SPARK on Windows 7/8.1/10 , as well as on Windows Server. Also, it will
> be great to read your experiences in using SPARK on Windows platform.
> >
> >
> > Thanks & Regards,
> > Gaurav Pathak
>


Re: spark 1.6 new memory management - some issues with tasks not using all executors

2016-02-29 Thread Lior Chaga
Hi Koret,
Try spark.shuffle.reduceLocality.enabled=false
This is an undocumented configuration.
See:
https://github.com/apache/spark/pull/8280
https://issues.apache.org/jira/browse/SPARK-10567

It solved the problem for me (both with and without memory legacy mode)


On Sun, Feb 28, 2016 at 11:16 PM, Koert Kuipers  wrote:

> i find it particularly confusing that a new memory management module would
> change the locations. its not like the hash partitioner got replaced. i can
> switch back and forth between legacy and "new" memory management and see
> the distribution change... fully reproducible
>
> On Sun, Feb 28, 2016 at 11:24 AM, Lior Chaga  wrote:
>
>> Hi,
>> I've experienced a similar problem upgrading from spark 1.4 to spark 1.6.
>> The data is not evenly distributed across executors, but in my case it
>> also reproduced with legacy mode.
>> Also tried 1.6.1 rc-1, with same results.
>>
>> Still looking for resolution.
>>
>> Lior
>>
>> On Fri, Feb 19, 2016 at 2:01 AM, Koert Kuipers  wrote:
>>
>>> looking at the cached rdd i see a similar story:
>>> with useLegacyMode = true the cached rdd is spread out across 10
>>> executors, but with useLegacyMode = false the data for the cached rdd sits
>>> on only 3 executors (the rest all show 0s). my cached RDD is a key-value
>>> RDD that got partitioned (hash partitioner, 50 partitions) before being
>>> cached.
>>>
>>> On Thu, Feb 18, 2016 at 6:51 PM, Koert Kuipers 
>>> wrote:
>>>
 hello all,
 we are just testing a semi-realtime application (it should return
 results in less than 20 seconds from cached RDDs) on spark 1.6.0. before
 this it used to run on spark 1.5.1

 in spark 1.6.0 the performance is similar to 1.5.1 if i set
 spark.memory.useLegacyMode = true, however if i switch to
 spark.memory.useLegacyMode = false the queries take about 50% to 100% more
 time.

 the issue becomes clear when i focus on a single stage: the individual
 tasks are not slower at all, but they run on less executors.
 in my test query i have 50 tasks and 10 executors. both with
 useLegacyMode = true and useLegacyMode = false the tasks finish in about 3
 seconds and show as running PROCESS_LOCAL. however when  useLegacyMode =
 false the tasks run on just 3 executors out of 10, while with useLegacyMode
 = true they spread out across 10 executors. all the tasks running on just a
 few executors leads to the slower results.

 any idea why this would happen?
 thanks! koert



>>>
>>
>


Re: Spark on Windows platform

2016-02-29 Thread Jörn Franke
I think Hortonworks has a Windows Spark distribution. Maybe Bigtop as well? 

> On 29 Feb 2016, at 14:27, gaurav pathak  wrote:
> 
> Can someone guide me the steps and information regarding, installation of 
> SPARK on Windows 7/8.1/10 , as well as on Windows Server. Also, it will be 
> great to read your experiences in using SPARK on Windows platform.
> 
> 
> Thanks & Regards,
> Gaurav Pathak

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: [Error]: Spark 1.5.2 + HiveHbase Integration

2016-02-29 Thread Ted Yu
Divya:
Please try not to cross post your question. 

In your case HBase-common jar is needed. To find all the hbase jars needed, you 
can run 'mvn dependency:tree' and check its output. 

> On Feb 29, 2016, at 1:48 AM, Divya Gehlot  wrote:
> 
> Hi,
> I am trying to access hive table which been created using HbaseIntegration 
> I am able to access data in Hive CLI 
> But when I am trying to access the table using hivecontext of Spark
> getting following error 
>> java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/util/Bytes
>> at 
>> org.apache.hadoop.hive.hbase.HBaseSerDe.parseColumnsMapping(HBaseSerDe.java:184)
>> at 
>> org.apache.hadoop.hive.hbase.HBaseSerDeParameters.(HBaseSerDeParameters.java:73)
>> at 
>> org.apache.hadoop.hive.hbase.HBaseSerDe.initialize(HBaseSerDe.java:117)
>> at 
>> org.apache.hadoop.hive.serde2.AbstractSerDe.initialize(AbstractSerDe.java:53)
>> at 
>> org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:521)
>> at 
>> org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:391)
>> at 
>> org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:276)
>> at 
>> org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:258)
>> at org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:605)
>> at 
>> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1$$anonfun$3.apply(ClientWrapper.scala:330)
>> at 
>> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1$$anonfun$3.apply(ClientWrapper.scala:325)
> 
> 
> 
> Have added following jars to Spark class path :
> /usr/hdp/2.3.4.0-3485/hive/lib/hive-hbase-handler.jar,
> /usr/hdp/2.3.4.0-3485/hive/lib/zookeeper-3.4.6.2.3.4.0-3485.jar,
> /usr/hdp/2.3.4.0-3485/hive/lib/guava-14.0.1.jar,
> /usr/hdp/2.3.4.0-3485/hive/lib/protobuf-java-2.5.0.jar
> 
> Which jar files  am I missing ??
> 
> 
> Thanks,
> Regards,
> Divya 


Spark on Windows platform

2016-02-29 Thread gaurav pathak
Can someone guide me the steps and information regarding, installation of
SPARK on Windows 7/8.1/10 , as well as on Windows Server. Also, it will be
great to read your experiences in using SPARK on Windows platform.


Thanks & Regards,
Gaurav Pathak


Implementation of random algorithm walk in spark

2016-02-29 Thread naveen.marri
Hi,

I'm new to spark, I'm trying to compute similarity between users/products.
I've a huge table which I can't do a self join with the cluster I have.

I'm trying to implement do self join using random walk methodology which
will approximately give the results. The table is a bipartite graph with 2
columns

Idea:
take any element(t1) in the first column in random
picking the corresponding element(t2) in for the element(t1) in the graph.
lookup for possible elements in the graph for t2 in random say t3
create a edge between t1 and t3
Iterate it in the order of atleat n*n so that results will be approximate
Questions

Is spark a suitable environment to do this?
I've coded logic for picking elements in random but facing issue when
building graph
Should consider graphx?
Any help is highly appreciated.

Regards,
Naveen



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Implementation-of-random-algorithm-walk-in-spark-tp26360.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Implementation of random algorithm walk in spark

2016-02-29 Thread naveenkumarmarri
Hi,

I'm new to spark, I'm trying to compute similarity between users/products.
I've a huge table which I can't do a self join with the cluster I have.

I'm trying to implement do self join using random walk methodology which
will approximately give the results. The table is a bipartite graph with 2
columns

Idea:

   - take any element(t1) in the first column in random
   - picking the corresponding element(t2) in for the element(t1) in the
   graph.
   - lookup for possible elements in the graph for t2 in random say t3
   - create a edge between t1 and t3
   - Iterate it in the order of atleat n*n so that results will be
   approximate

Questions


   - Is spark a suitable environment to do this?
   - I've coded logic for picking elements in random but facing issue when
   building graph
   - Should consider graphx?

Any help is highly appreciated.

Regards,
Naveen


Deadlock between UnifiedMemoryManager and BlockManager

2016-02-29 Thread Sea
Hi??all??
 My spark version is 1.6.0, I found a deadlock in production environment, 
Anyone can help? I create an issue in jira: 
https://issues.apache.org/jira/browse/SPARK-13566




===
"block-manager-slave-async-thread-pool-1":
at org.apache.spark.storage.MemoryStore.remove(MemoryStore.scala:216)
- waiting to lock <0x0005895b09b0> (a 
org.apache.spark.memory.UnifiedMemoryManager)
at 
org.apache.spark.storage.BlockManager.removeBlock(BlockManager.scala:1114)
- locked <0x00058ed6aae0> (a org.apache.spark.storage.BlockInfo)
at 
org.apache.spark.storage.BlockManager$$anonfun$removeBroadcast$2.apply(BlockManager.scala:1101)
at 
org.apache.spark.storage.BlockManager$$anonfun$removeBroadcast$2.apply(BlockManager.scala:1101)
at scala.collection.immutable.Set$Set2.foreach(Set.scala:94)
at 
org.apache.spark.storage.BlockManager.removeBroadcast(BlockManager.scala:1101)
at 
org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$4.apply$mcI$sp(BlockManagerSlaveEndpoint.scala:65)
at 
org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$4.apply(BlockManagerSlaveEndpoint.scala:65)
at 
org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$4.apply(BlockManagerSlaveEndpoint.scala:65)
at 
org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$1.apply(BlockManagerSlaveEndpoint.scala:84)
at 
scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at 
scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
"Executor task launch worker-10":
at 
org.apache.spark.storage.BlockManager.dropFromMemory(BlockManager.scala:1032)
- waiting to lock <0x00059a0988b8> (a 
org.apache.spark.storage.BlockInfo)
at 
org.apache.spark.storage.BlockManager.dropFromMemory(BlockManager.scala:1009)
at 
org.apache.spark.storage.MemoryStore$$anonfun$evictBlocksToFreeSpace$2.apply(MemoryStore.scala:460)
at 
org.apache.spark.storage.MemoryStore$$anonfun$evictBlocksToFreeSpace$2.apply(MemoryStore.scala:449)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

spark lda runs out of disk space

2016-02-29 Thread TheGeorge1918 .
Hi guys


I was running lda with 2000 topics on 6G compressed data, roughly 1.2
million docs. I used aws 3 r3.8xlarge machines as core nodes. It turned out
spark applications crushed after 3 or 4 iterations. From ganglia, it
indicated the disk space was all consumed. I believe it’s the shuffle data
not cleaned up. I tried following to solve the problem


1) I set the ‘checkpointInterval’ to be 5 and properly configured
’setCheckpointDir’ in spark context. It turned out it didn't help at all.

2) I cannot really use ‘spark.cleaner.ttl’ since I have a broadcasted
variable needed for the whole computation.


I observed that for each iteration (each iteration in variational
inference), it created 500GB shuffle data. If it runs 100 iterations, it
will need 50TB data. So the only option left for me is to use VER big
machines which could be way too expensive.


So, I just wonder


1) is there any other way to solve this problem?

2) what’s the typical config (type of machines) to be used for 1 million
docs with a few thousands topics?



Thanks


Best


Xuan


What is the best approach to perform concurrent updates from different jobs to a in memory dataframe registered as a temp table?

2016-02-29 Thread Roger Marin
Hi all,

I have multiple (>100) jobs running concurrently (sharing the same hive
context) that are each appending new rows to the same dataframe registered
as a temp table.

Currently I am using unionAll and registering that dataframe again as a
temp table in each job:

Given an existing dataframe registered as the temp table "test":

//Create dataframe with new rows to append
val newRows = hiveContext.createDataframe (rows,schema)

//Retrieve existing dataframe and append the new dataframe via unionAll
val updatedDF=hiveContext.table("test").unionAll(newRows)

//uncache existing dataframe
hiveContext.uncacheTable("test")

//Register the updated DF as a temp table
updatedDF.registerTempTable("test")

//Cache the updated dataframe
hiveContext.table("test").cache

I am finding that using this approach can deplete memory very quickly since
each call to ".cache" in each of the jobs is creating a new entry in memory
for the same dataframe.

Does anyone know if theres a more optimal solution to the above?.

Thanks,
Roger


[Help]: Steps to access hive table + Spark 1.5.2 + HbaseIntegration + Hive 1.2 + Hbase 1.1

2016-02-29 Thread Divya Gehlot
Hi,

Can anybody help me by sharing the steps/examples
How to connect to hive table(which is being created using HbaseIntegration
 )
through hivecontext in Spark
I googled but couldnt find a single example/document .

Would really appreciate the help.



Thanks,
Divya


Re: DirectFileOutputCommiter

2016-02-29 Thread Steve Loughran

> On 26 Feb 2016, at 06:24, Takeshi Yamamuro  wrote:
> 
> Hi,
> 
> Great work!
> What is the concrete performance gain of the committer on s3?
> I'd like to know.
> 
> I think there is no direct committer for files because these kinds of 
> committer has risks
> to loss data (See: SPARK-10063).
> Until this resolved, ISTM files cannot support direct commits.
> 

that's speculative output via any committer; you cannot use s3 as a speculative 
destination for spark, MR, hive, etc.

Speculative output relies on being able to commit a file operation (create with 
overwrite==false) file rename or directory rename being atomic with respect to 
the check for the destination existing and the operation of creating or 
renaming. There's also a tendency to assume that file directory/rename 
operations are O(1)

S3 (and openstack swift) don't offer those semantics. The check for existence 
is done client-side before the operation, against a remote store whose metadata 
may not be consistent anyway (i.e it says the blob isn't there when it is, and 
vice versa). With rename() and delete() being done client-side, they are 
O(files * len(files), and can fail partway through.

what the direct committer does is bypass any attempt to write then commit by 
renaming, which is the performance killer.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Unresolved dep when building project with spark 1.6

2016-02-29 Thread Hao Ren
Hi,

I am upgrading my project to spark 1.6.
It seems that the deps are broken.

Deps used in sbt

val scalaVersion = "2.10"
val sparkVersion  = "1.6.0"
val hadoopVersion = "2.7.1"

// Libraries
val scalaTest = "org.scalatest" %% "scalatest" % "2.2.4" % "test"
val sparkSql  = "org.apache.spark" %% "spark-sql" % sparkVersion
val sparkML   = "org.apache.spark" %% "spark-mllib" % sparkVersion
val hadoopAWS = "org.apache.hadoop" % "hadoop-aws" % hadoopVersion
val scopt = "com.github.scopt" %% "scopt" % "3.3.0"
val jodacvt   = "org.joda" % "joda-convert" % "1.8.1"

Sbt exception:

[warn] ::
[warn] ::  UNRESOLVED DEPENDENCIES ::
[warn] ::
[warn] :: org.fusesource.leveldbjni#leveldbjni-all;1.8:
org.fusesource.leveldbjni#leveldbjni-all;1.8!leveldbjni-all.pom(pom.original)
origin location must be absolute:
file:/home/invkrh/.m2/repository/org/fusesource/leveldbjni/leveldbjni-all/1.8/leveldbjni-all-1.8.pom
[warn] ::
[warn]
[warn] Note: Unresolved dependencies path:
[warn] org.fusesource.leveldbjni:leveldbjni-all:1.8
[warn]  +- org.apache.spark:spark-network-shuffle_2.10:1.6.0
[warn]  +- org.apache.spark:spark-core_2.10:1.6.0
[warn]  +- org.apache.spark:spark-catalyst_2.10:1.6.0
[warn]  +- org.apache.spark:spark-sql_2.10:1.6.0
(/home/invkrh/workspace/scala/confucius/botdet/build.sbt#L14)
[warn]  +- org.apache.spark:spark-mllib_2.10:1.6.0
(/home/invkrh/workspace/scala/confucius/botdet/build.sbt#L14)
[warn]  +- fr.leboncoin:botdet_2.10:0.1
sbt.ResolveException: unresolved dependency:
org.fusesource.leveldbjni#leveldbjni-all;1.8:
org.fusesource.leveldbjni#leveldbjni-all;1.8!leveldbjni-all.pom(pom.original)
origin location must be absolute:
file:/home/invkrh/.m2/repository/org/fusesource/leveldbjni/leveldbjni-all/1.8/leveldbjni-all-1.8.pom

Thank you.

-- 
Hao Ren

Data Engineer @ leboncoin

Paris, France


[Error]: Spark 1.5.2 + HiveHbase Integration

2016-02-29 Thread Divya Gehlot
Hi,
I am trying to access hive table which been created using HbaseIntegration

I am able to access data in Hive CLI
But when I am trying to access the table using hivecontext of Spark
getting following error

> java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/util/Bytes
> at
> org.apache.hadoop.hive.hbase.HBaseSerDe.parseColumnsMapping(HBaseSerDe.java:184)
> at
> org.apache.hadoop.hive.hbase.HBaseSerDeParameters.(HBaseSerDeParameters.java:73)
> at
> org.apache.hadoop.hive.hbase.HBaseSerDe.initialize(HBaseSerDe.java:117)
> at
> org.apache.hadoop.hive.serde2.AbstractSerDe.initialize(AbstractSerDe.java:53)
> at
> org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:521)
> at
> org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:391)
> at
> org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:276)
> at
> org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:258)
> at org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:605)
> at
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1$$anonfun$3.apply(ClientWrapper.scala:330)
> at
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1$$anonfun$3.apply(ClientWrapper.scala:325)



Have added following jars to Spark class path :
/usr/hdp/2.3.4.0-3485/hive/lib/hive-hbase-handler.jar,
/usr/hdp/2.3.4.0-3485/hive/lib/zookeeper-3.4.6.2.3.4.0-3485.jar,
/usr/hdp/2.3.4.0-3485/hive/lib/guava-14.0.1.jar,
/usr/hdp/2.3.4.0-3485/hive/lib/protobuf-java-2.5.0.jar

Which jar files  am I missing ??


Thanks,
Regards,
Divya


Re: Recommendation for a good book on Spark, beginner to moderate knowledge

2016-02-29 Thread charles li
since spark is under actively developing, so take a book to learn it is
somehow outdated to some degree.

I would like to suggest learn it from several ways as bellow:


   - spark official document, trust me, you will go through this for
   several time if you want to learn in well : http://spark.apache.org/
   - spark summit, lots of videos and slide, high quality :
   https://spark-summit.org/
   - databricks' blog : https://databricks.com/blog
   - attend spark meetup : http://www.meetup.com/
   - try spark 3-party package if needed and convenient :
   http://spark-packages.org/
   - and I just start to blog my spark learning memo on my blog:
   http://litaotao.github.io


in a word, I think the best way to learn it is official *document +
databricks blog + others' blog ===>>> your blog [ tutorial by you or just
memo for your learning ]*

On Mon, Feb 29, 2016 at 4:50 PM, Ashok Kumar 
wrote:

> Thank you all for valuable advice. Much appreciated
>
> Best
>
>
> On Sunday, 28 February 2016, 21:48, Ashok Kumar 
> wrote:
>
>
>   Hi Gurus,
>
> Appreciate if you recommend me a good book on Spark or documentation for
> beginner to moderate knowledge
>
> I very much like to skill myself on transformation and action methods.
>
> FYI, I have already looked at examples on net. However, some of them not
> clear at least to me.
>
> Warmest regards
>
>
>


-- 
*--*
a spark lover, a quant, a developer and a good man.

http://github.com/litaotao


Re: Stateful Operation on JavaPairDStream Help Needed !!

2016-02-29 Thread Abhishek Anand
Hi Ryan,

Its not working even after removing the reduceByKey.

So, basically I am doing the following
- reading from kafka
- flatmap inside transform
- mapWithState
- rdd.count on output of mapWithState

But to my surprise still dont see checkpointing taking place.

Is there any restriction to the type of operation that we can perform
inside mapWithState ?

Really need to resolve this one as currently if my application is restarted
from checkpoint it has to repartition 120 previous stages which takes hell
lot of time.

Thanks !!
Abhi

On Mon, Feb 29, 2016 at 3:42 AM, Shixiong(Ryan) Zhu  wrote:

> Sorry that I forgot to tell you that you should also call `rdd.count()`
> for "reduceByKey" as well. Could you try it and see if it works?
>
> On Sat, Feb 27, 2016 at 1:17 PM, Abhishek Anand 
> wrote:
>
>> Hi Ryan,
>>
>> I am using mapWithState after doing reduceByKey.
>>
>> I am right now using mapWithState as you suggested and triggering the
>> count manually.
>>
>> But, still unable to see any checkpointing taking place. In the DAG I can
>> see that the reduceByKey operation for the previous batches are also being
>> computed.
>>
>>
>> Thanks
>> Abhi
>>
>>
>> On Tue, Feb 23, 2016 at 2:36 AM, Shixiong(Ryan) Zhu <
>> shixi...@databricks.com> wrote:
>>
>>> Hey Abhi,
>>>
>>> Using reducebykeyandwindow and mapWithState will trigger the bug
>>> in SPARK-6847. Here is a workaround to trigger checkpoint manually:
>>>
>>> JavaMapWithStateDStream<...> stateDStream =
>>> myPairDstream.mapWithState(StateSpec.function(mappingFunc));
>>> stateDStream.foreachRDD(new Function1<...>() {
>>>   @Override
>>>   public Void call(JavaRDD<...> rdd) throws Exception {
>>> rdd.count();
>>>   }
>>> });
>>> return stateDStream.stateSnapshots();
>>>
>>>
>>> On Mon, Feb 22, 2016 at 12:25 PM, Abhishek Anand <
>>> abhis.anan...@gmail.com> wrote:
>>>
 Hi Ryan,

 Reposting the code.

 Basically my use case is something like - I am receiving the web
 impression logs and may get the notify (listening from kafka) for those
 impressions in the same interval (for me its 1 min) or any next interval
 (upto 2 hours). Now, when I receive notify for a particular impression I
 need to swap the date field in impression with the date field in notify
 logs. The notify for an impression has the same key as impression.

 static Function3, State,
 Tuple2> mappingFunc =
 new Function3, State, Tuple2>>> MyClass>>() {
 @Override
 public Tuple2 call(String key, Optional one,
 State state) {
 MyClass nullObj = new MyClass();
 nullObj.setImprLog(null);
 nullObj.setNotifyLog(null);
 MyClass current = one.or(nullObj);

 if(current!= null && current.getImprLog() != null &&
 current.getMyClassType() == 1 /*this is impression*/){
 return new Tuple2<>(key, null);
 }
 else if (current.getNotifyLog() != null  && current.getMyClassType() ==
 3 /*notify for the impression received*/){
 MyClass oldState = (state.exists() ? state.get() : nullObj);
 if(oldState!= null && oldState.getNotifyLog() != null){
 oldState.getImprLog().setCrtd(current.getNotifyLog().getCrtd());
  //swappping the dates
 return new Tuple2<>(key, oldState);
 }
 else{
 return new Tuple2<>(key, null);
 }
 }
 else{
 return new Tuple2<>(key, null);
 }

 }
 };


 return
 myPairDstream.mapWithState(StateSpec.function(mappingFunc)).stateSnapshots();


 Currently I am using reducebykeyandwindow without the inverse function
 and I am able to get the correct data. But, issue the might arise is when I
 have to restart my application from checkpoint and it repartitions and
 computes the previous 120 partitions, which delays the incoming batches.


 Thanks !!
 Abhi

 On Tue, Feb 23, 2016 at 1:25 AM, Shixiong(Ryan) Zhu <
 shixi...@databricks.com> wrote:

> Hey Abhi,
>
> Could you post how you use mapWithState? By default, it should do
> checkpointing every 10 batches.
> However, there is a known issue that prevents mapWithState from
> checkpointing in some special cases:
> https://issues.apache.org/jira/browse/SPARK-6847
>
> On Mon, Feb 22, 2016 at 5:47 AM, Abhishek Anand <
> abhis.anan...@gmail.com> wrote:
>
>> Any Insights on this one ?
>>
>>
>> Thanks !!!
>> Abhi
>>
>> On Mon, Feb 15, 2016 at 11:08 PM, Abhishek Anand <
>> abhis.anan...@gmail.com> wrote:
>>
>>> I am now trying to use mapWithState in the following way using some
>>> example codes. But, by looking at the DAG it does not seem to checkpoint
>>> the state and when restarting the application from checkpoint, it
>>> re-partitions all the previous batches data from kafka.
>>>
>>> static Function3, State,
>>> Tuple2> mappingFunc =
>>> new Function3, State,
>>> Tuple2>

Re: java.io.IOException: java.lang.reflect.InvocationTargetException on new spark machines

2016-02-29 Thread Abhishek Anand
Hi Ryan,

I was able to resolve this issue. The /tmp location was mounted with
"noexec" option. Removing this noexec in the fstab resolved the issue. The
snappy shared object file is created at the /tmp location so either
removing the noexec from mount or changing the default temp location solved
ths issue.

export _JAVA_OPTIONS=-Djava.io.tmpdir=/mycustometemplocation



Thanks !!
Abhi


On Mon, Feb 29, 2016 at 3:46 AM, Shixiong(Ryan) Zhu  wrote:

> This is because the Snappy library cannot load the native library. Did you
> forget to install the snappy native library in your new machines?
>
> On Fri, Feb 26, 2016 at 11:05 PM, Abhishek Anand 
> wrote:
>
>> Any insights on this ?
>>
>> On Fri, Feb 26, 2016 at 1:21 PM, Abhishek Anand 
>> wrote:
>>
>>> On changing the default compression codec which is snappy to lzf the
>>> errors are gone !!
>>>
>>> How can I fix this using snappy as the codec ?
>>>
>>> Is there any downside of using lzf as snappy is the default codec that
>>> ships with spark.
>>>
>>>
>>> Thanks !!!
>>> Abhi
>>>
>>> On Mon, Feb 22, 2016 at 7:42 PM, Abhishek Anand >> > wrote:
>>>
 Hi ,

 I am getting the following exception on running my spark streaming job.

 The same job has been running fine since long and when I added two new
 machines to my cluster I see the job failing with the following exception.



 16/02/22 19:23:01 ERROR Executor: Exception in task 2.0 in stage 4229.0
 (TID 22594)
 java.io.IOException: java.lang.reflect.InvocationTargetException
 at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1257)
 at
 org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:165)
 at
 org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
 at
 org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
 at
 org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:88)
 at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:59)
 at org.apache.spark.scheduler.Task.run(Task.scala:70)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:744)
 Caused by: java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
 at
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
 at
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
 at
 org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:68)
 at
 org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:60)
 at org.apache.spark.broadcast.TorrentBroadcast.org
 $apache$spark$broadcast$TorrentBroadcast$$setConf(TorrentBroadcast.scala:73)
 at
 org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:167)
 at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1254)
 ... 11 more
 Caused by: java.lang.IllegalArgumentException
 at
 org.apache.spark.io.SnappyCompressionCodec.(CompressionCodec.scala:152)
 ... 20 more



 Thanks !!!
 Abhi

>>>
>>>
>>
>


Re: Spark Integration Patterns

2016-02-29 Thread Alex Dzhagriev
Hi Moshir,

I think you can use the rest api provided with Spark:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/rest/RestSubmissionServer.scala

Unfortunately, I haven't find any documentation, but it looks fine.
Thanks, Alex.

On Sun, Feb 28, 2016 at 3:25 PM, mms  wrote:

> Hi, I cannot find a simple example showing how a typical application can
> 'connect' to a remote spark cluster and interact with it. Let's say I have
> a Python web application hosted somewhere *outside *a spark cluster, with
> just python installed on it. How can I talk to Spark without using a
> notebook, or using ssh to connect to a cluster master node ? I know of
> spark-submit and spark-shell, however forking a process on a remote host to
> execute a shell script seems like a lot of effort What are the recommended
> ways to connect and query Spark from a remote client ? Thanks Thx !
> --
> View this message in context: Spark Integration Patterns
> 
> Sent from the Apache Spark User List mailing list archive
>  at Nabble.com.
>


Re: Recommendation for a good book on Spark, beginner to moderate knowledge

2016-02-29 Thread Ashok Kumar
Thank you all for valuable advice. Much appreciated
Best 

On Sunday, 28 February 2016, 21:48, Ashok Kumar  
wrote:
 

   Hi Gurus,
Appreciate if you recommend me a good book on Spark or documentation for 
beginner to moderate knowledge
I very much like to skill myself on transformation and action methods.
FYI, I have already looked at examples on net. However, some of them not clear 
at least to me.
Warmest regards

  

Re: Spark Integration Patterns

2016-02-29 Thread moshir mikael
Well,

I have a personal project where I want to build a *spreadsheet *on top of
spark.
I have a version of my app running on postgresql, which does not scale, and
would like to move data processing to spark.
You can import data, explore data, analyze data, visualize data ...
You don't need to be an advanced technical user to use it.
I believe it would be much easier to use spark than postgresql for this
kind of dynamic data exploration.
For instance, a formula can be achieved with a simple #.map statement.

However, I need some kind of connection to spark to iterate with RDD.
This is what makes me wonder of how you can talk to spark from a web app.










Le dim. 28 févr. 2016 à 23:36, ayan guha  a écrit :

> I believe you are looking  for something like Spark Jobserver for running
> jobs & JDBC server for accessing data? I am curious to know more about it,
> any further discussion will be very helpful
>
> On Mon, Feb 29, 2016 at 6:06 AM, Luciano Resende 
> wrote:
>
>> One option we have used in the past is to expose spark application
>> functionality via REST, this would enable python or any other client that
>> is capable of doing a HTTP request to integrate with your Spark application.
>>
>> To get you started, this might be a useful reference
>>
>>
>> http://blog.michaelhamrah.com/2013/06/scala-web-apis-up-and-running-with-spray-and-akka/
>>
>>
>> On Sun, Feb 28, 2016 at 10:38 AM, moshir mikael 
>> wrote:
>>
>>> Ok,
>>> but what do I need for the program to run.
>>> In python  sparkcontext  = SparkContext(conf) only works when you have
>>> spark installed locally.
>>> AFAIK there is no *pyspark *package for python that you can install
>>> doing pip install pyspark.
>>> You actually need to install spark to get it running (e.g :
>>> https://github.com/KristianHolsheimer/pyspark-setup-guide).
>>>
>>> Does it mean you need to install spark on the box your applications runs
>>> to benefit from pyspark and this is required to connect to another remote
>>> spark cluster ?
>>> Am I missing something obvious ?
>>>
>>>
>>> Le dim. 28 févr. 2016 à 19:01, Todd Nist  a écrit :
>>>
 Define your SparkConfig to set the master:

   val conf = new SparkConf().setAppName(AppName)
 .setMaster(SparkMaster)
 .set()

 Where SparkMaster = "spark://SparkServerHost:7077".  So if your spark
 server hostname it "RADTech" then it would be "spark://RADTech:7077".

 Then when you create the SparkContext, pass the SparkConf  to it:

 val sparkContext = new SparkContext(conf)

 Then use the sparkContext for interact with the SparkMaster / Cluster.
 Your program basically becomes the driver.

 HTH.

 -Todd

 On Sun, Feb 28, 2016 at 9:25 AM, mms  wrote:

> Hi, I cannot find a simple example showing how a typical application
> can 'connect' to a remote spark cluster and interact with it. Let's say I
> have a Python web application hosted somewhere *outside *a spark
> cluster, with just python installed on it. How can I talk to Spark without
> using a notebook, or using ssh to connect to a cluster master node ? I 
> know
> of spark-submit and spark-shell, however forking a process on a remote 
> host
> to execute a shell script seems like a lot of effort What are the
> recommended ways to connect and query Spark from a remote client ? Thanks
> Thx !
> --
> View this message in context: Spark Integration Patterns
> 
> Sent from the Apache Spark User List mailing list archive
>  at Nabble.com.
>


>>
>>
>> --
>> Luciano Resende
>> http://people.apache.org/~lresende
>> http://twitter.com/lresende1975
>> http://lresende.blogspot.com/
>>
>
>
>
> --
> Best Regards,
> Ayan Guha
>


Re: DirectFileOutputCommiter

2016-02-29 Thread Takeshi Yamamuro
Hi,

I think the essential culprit is that these committers are not idempotent;
retry attempts will fail.
See codes below for details;
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriterContainer.scala#L130

On Sat, Feb 27, 2016 at 7:38 PM, Igor Berman  wrote:

> Hi Reynold,
> thanks for the response
> Yes, speculation mode needs some coordination.
> Regarding job failure :
> correct me if I wrong - if one of jobs fails - client code will be sort of
> "notified" by exception or something similar, so the client can decide to
> re-submit action(job), i.e. it won't be "silent" failure.
>
>
> On 26 February 2016 at 11:50, Reynold Xin  wrote:
>
>> It could lose data in speculation mode, or if any job fails.
>>
>> On Fri, Feb 26, 2016 at 3:45 AM, Igor Berman 
>> wrote:
>>
>>> Takeshi, do you know the reason why they wanted to remove this commiter
>>> in SPARK-10063?
>>> the jira has no info inside
>>> as far as I understand the direct committer can't be used when either of
>>> two is true
>>> 1. speculation mode
>>> 2. append mode(ie. not creating new version of data but appending to
>>> existing data)
>>>
>>> On 26 February 2016 at 08:24, Takeshi Yamamuro 
>>> wrote:
>>>
 Hi,

 Great work!
 What is the concrete performance gain of the committer on s3?
 I'd like to know.

 I think there is no direct committer for files because these kinds of
 committer has risks
 to loss data (See: SPARK-10063).
 Until this resolved, ISTM files cannot support direct commits.

 thanks,



 On Fri, Feb 26, 2016 at 8:39 AM, Teng Qiu  wrote:

> yes, should be this one
> https://gist.github.com/aarondav/c513916e72101bbe14ec
>
> then need to set it in spark-defaults.conf :
> https://github.com/zalando/spark/commit/3473f3f1ef27830813c1e0b3686e96a55f49269c#diff-f7a46208be9e80252614369be6617d65R13
>
> Am Freitag, 26. Februar 2016 schrieb Yin Yang :
> > The header of DirectOutputCommitter.scala says Databricks.
> > Did you get it from Databricks ?
> > On Thu, Feb 25, 2016 at 3:01 PM, Teng Qiu  wrote:
> >>
> >> interesting in this topic as well, why
> the DirectFileOutputCommitter not included?
> >> we added it in our fork,
> under 
> core/src/main/scala/org/apache/spark/mapred/DirectOutputCommitter.scala
> >> moreover, this DirectFileOutputCommitter is not working for the
> insert operations in HiveContext, since the Committer is called by hive
> (means uses dependencies in hive package)
> >> we made some hack to fix this, you can take a look:
> >>
> https://github.com/apache/spark/compare/branch-1.6...zalando:branch-1.6-zalando
> >>
> >> may bring some ideas to other spark contributors to find a better
> way to use s3.
> >>
> >> 2016-02-22 23:18 GMT+01:00 igor.berman :
> >>>
> >>> Hi,
> >>> Wanted to understand if anybody uses DirectFileOutputCommitter or
> alikes
> >>> especially when working with s3?
> >>> I know that there is one impl in spark distro for parquet format,
> but not
> >>> for files -  why?
> >>>
> >>> Imho, it can bring huge performance boost.
> >>> Using default FileOutputCommiter with s3 has big overhead at
> commit stage
> >>> when all parts are copied one-by-one to destination dir from
> _temporary,
> >>> which is bottleneck when number of partitions is high.
> >>>
> >>> Also, wanted to know if there are some problems when using
> >>> DirectFileOutputCommitter?
> >>> If writing one partition directly will fail in the middle is spark
> will
> >>> notice this and will fail job(say after all retries)?
> >>>
> >>> thanks in advance
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/DirectFileOutputCommiter-tp26296.html
> >>> Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
> >>>
> >>>
> -
> >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >>> For additional commands, e-mail: user-h...@spark.apache.org
> >>>
> >>
> >
> >
>



 --
 ---
 Takeshi Yamamuro

>>>
>>>
>>
>


-- 
---
Takeshi Yamamuro