Re: Reading lzo+index with spark-csv (Splittable reads)

2016-01-29 Thread syepes
Well looking at the src it look like its not implemented:

https://github.com/databricks/spark-csv/blob/master/src/main/scala/com/databricks/spark/csv/util/TextFile.scala#L34-L36





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Reading-lzo-index-with-spark-csv-Splittable-reads-tp26103p26105.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Reading lzo+index with spark-csv (Splittable reads)

2016-01-29 Thread syepes
Hello,
​
I have managed to speed up the read stage when loading CSV files using the
classic "newAPIHadoopFile" method, the issue is that I would like to use the
spark-csv package and it seams that its not taking into consideration the
LZO Index file / Splittable reads.

/# Using the classic method the read is fully parallelized (Splittable)/
sc.newAPIHadoopFile("/user/sy/data.csv.lzo",  ).count

/# When spark-csv is used the file is read only from one node (No Splittable
reads)/
sqlContext.read.format("com.databricks.spark.csv").options(Map("path" ->
"/user/sy/data.csv.lzo", "header" -> "true", "inferSchema" ->
"false")).load().count()

Does anyone know if this is currently supported?





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Reading-lzo-index-with-spark-csv-Splittable-reads-tp26103.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



​Spark 1.6 - YARN Cluster Mode

2015-12-17 Thread syepes
Hello,

This week I have been testing 1.6 (#d509194b) in our HDP 2.3 platform and
its been working pretty ok, at the exception of the YARN cluster deployment
mode.
Note that with 1.5 using the same "spark-props.conf" and "spark-env.sh"
config files the cluster mode works as expected.

Has anyone else also tried the cluster mode in 1.6?


Problem reproduction:

# spark-submit --master yarn --deploy-mode cluster --num-executors 1 
--properties-file $PWD/spark-props.conf --class
org.apache.spark.examples.SparkPi
/opt/spark/lib/spark-examples-1.6.0-SNAPSHOT-hadoop2.7.1.jar

Error: Could not find or load main class
org.apache.spark.deploy.yarn.ApplicationMaster

spark-props.conf
-
spark.driver.extraJavaOptions-Dhdp.version=2.3.2.0-2950
spark.driver.extraLibraryPath   
/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
spark.executor.extraJavaOptions  -Dhdp.version=2.3.2.0-2950
spark.executor.extraLibraryPath 
/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
-

I will try to do some more debugging on this issue.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-6-YARN-Cluster-Mode-tp25729.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: [Yarn-Client]Can not access SparkUI

2015-10-26 Thread syepes
Hello Earthson,

Is you cluster multihom​ed​?

If yes, try setting the variables SPARK_LOCAL_{IP,HOSTNAME} I had this issue
before: https://issues.apache.org/jira/browse/SPARK-11147



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Yarn-Client-Can-not-access-SparkUI-tp25197p25199.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Kafka createDirectStream ​issue

2015-06-24 Thread syepes
Hello,

Thanks for all the help on resolving this issue, especially to Cody who
guided me to the solution.

For other facing similar issues, basically the issue was that I was running
Spark Streaming jobs from the spark-shell and this is not supported. Running
the same job through spark-submit work as expected.

Does anyone know if there some kind of way to get around this problem?
The build jar/submit process is a bit cumbersome when trying to debug and
testing new jobs..


Best regards,
Sebastian



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-createDirectStream-issue-tp23456p23467.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Kafka createDirectStream ​issue

2015-06-23 Thread syepes
Hello,

I ​am trying ​use the new Kafka ​consumer ​​KafkaUtils.createDirectStream​
but I am having some issues making it work.
I have tried different versions of Spark v1.4.0 and branch-1.4 #8d6e363 and
I am still getting the same strange exception ClassNotFoundException:
$line49.$read$$iwC$$i

Has anyone else been facing this kind of problem?

The following is the code and logs that I have been using to reproduce the
issue:

spark-shell: script
--
sc.stop()
import _root_.kafka.serializer.StringDecoder
import org.apache.spark.SparkConf
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka.KafkaUtils

val sparkConf = new
SparkConf().setMaster(spark://localhost:7077).setAppName(KCon).set(spark.ui.port,
4041 ).set(spark.driver.allowMultipleContexts,
true).setJars(Array(/opt/spark-libs/spark-streaming-kafka-assembly_2.10-1.4.2-SNAPSHOT.jar))
val ssc = new StreamingContext(sparkConf, Seconds(5))

val kafkaParams = Map[String, String](bootstrap.servers -
localhost:9092, schema.registry.url - http://localhost:8081;,
zookeeper.connect - localhost:2181, group.id - KCon )
val topic = Set(test)
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder,
StringDecoder](ssc, kafkaParams, topic)

val raw = messages.map(_._2)
val words = raw.flatMap(_.split( ))
val wordCounts = words.map(x = (x, 1L)).reduceByKey(_ + _)
wordCounts.print()

ssc.start()
ssc.awaitTermination()
--


spark-shell: output
--
sparkConf: org.apache.spark.SparkConf = org.apache.spark.SparkConf@330e37b2
ssc: org.apache.spark.streaming.StreamingContext =
org.apache.spark.streaming.StreamingContext@28ec9c23
kafkaParams: scala.collection.immutable.Map[String,String] =
Map(bootstrap.servers - localhost:9092, schema.registry.url -
http://localhost:8081, zookeeper.connect - localhost:2181, group.id -
OPC)topic: scala.collection.immutable.Set[String] = Set(test)
WARN  [main] kafka.utils.VerifiableProperties - Property schema.registry.url
is not valid
messages: org.apache.spark.streaming.dstream.InputDStream[(String, String)]
= org.apache.spark.streaming.kafka.DirectKafkaInputDStream@1e71b70d
raw: org.apache.spark.streaming.dstream.DStream[String] =
org.apache.spark.streaming.dstream.MappedDStream@578ce232
words: org.apache.spark.streaming.dstream.DStream[String] =
org.apache.spark.streaming.dstream.FlatMappedDStream@351cc4b5
wordCounts: org.apache.spark.streaming.dstream.DStream[(String, Long)] =
org.apache.spark.streaming.dstream.ShuffledDStream@ae04104
WARN  [JobGenerator] kafka.utils.VerifiableProperties - Property
schema.registry.url is not valid
WARN  [task-result-getter-0] org.apache.spark.scheduler.TaskSetManager -
Lost task 0.0 in stage 0.0 (TID 0, 10.3.30.87):
java.lang.ClassNotFoundException:
$line49.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at
org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:66)
at
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
at
java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
..
..
Driver stacktrace:
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1266)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1257)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1256)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1256)
--


Best regards and thanks in advance for any help.
  



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-createDirectStream-issue-tp23456.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 

Re: Kafka createDirectStream ​issue

2015-06-23 Thread syepes
yes, I have two clusters one standalone an another using Mesos

 Sebastian YEPES
   http://sebastian-yepes.com

On Wed, Jun 24, 2015 at 12:37 AM, drarse [via Apache Spark User List] 
ml-node+s1001560n23457...@n3.nabble.com wrote:

 Hi syepes,
 Are u run the application in standalone mode?
 Regards
 El 23/06/2015 22:48, syepes [via Apache Spark User List] [hidden email]
 http:///user/SendEmail.jtp?type=nodenode=23457i=0 escribió:

 Hello,

 I ​am trying ​use the new Kafka ​consumer
 ​​KafkaUtils.createDirectStream​ but I am having some issues making it
 work.
 I have tried different versions of Spark v1.4.0 and branch-1.4 #8d6e363
 and I am still getting the same strange exception ClassNotFoundException:
 $line49.$read$$iwC$$i

 Has anyone else been facing this kind of problem?

 The following is the code and logs that I have been using to reproduce
 the issue:

 spark-shell: script
 --
 sc.stop()
 import _root_.kafka.serializer.StringDecoder
 import org.apache.spark.SparkConf
 import org.apache.spark.streaming._
 import org.apache.spark.streaming.kafka.KafkaUtils

 val sparkConf = new
 SparkConf().setMaster(spark://localhost:7077).setAppName(KCon).set(spark.ui.port,
 4041 ).set(spark.driver.allowMultipleContexts,
 true).setJars(Array(/opt/spark-libs/spark-streaming-kafka-assembly_2.10-1.4.2-SNAPSHOT.jar))

 val ssc = new StreamingContext(sparkConf, Seconds(5))

 val kafkaParams = Map[String, String](bootstrap.servers -
 localhost:9092, schema.registry.url - http://localhost:8081;,
 zookeeper.connect - localhost:2181, group.id - KCon )
 val topic = Set(test)
 val messages = KafkaUtils.createDirectStream[String, String,
 StringDecoder, StringDecoder](ssc, kafkaParams, topic)

 val raw = messages.map(_._2)
 val words = raw.flatMap(_.split( ))
 val wordCounts = words.map(x = (x, 1L)).reduceByKey(_ + _)
 wordCounts.print()

 ssc.start()
 ssc.awaitTermination()
 --


 spark-shell: output
 --
 sparkConf: org.apache.spark.SparkConf =
 org.apache.spark.SparkConf@330e37b2
 ssc: org.apache.spark.streaming.StreamingContext =
 org.apache.spark.streaming.StreamingContext@28ec9c23
 kafkaParams: scala.collection.immutable.Map[String,String] =
 Map(bootstrap.servers - localhost:9092, schema.registry.url -
 http://localhost:8081, zookeeper.connect - localhost:2181, group.id -
 OPC)topic: scala.collection.immutable.Set[String] = Set(test)
 WARN  [main] kafka.utils.VerifiableProperties - Property
 schema.registry.url is not valid
 messages: org.apache.spark.streaming.dstream.InputDStream[(String,
 String)] = org.apache.spark.streaming.kafka.DirectKafkaInputDStream@1e71b70d
 raw: org.apache.spark.streaming.dstream.DStream[String] =
 org.apache.spark.streaming.dstream.MappedDStream@578ce232
 words: org.apache.spark.streaming.dstream.DStream[String] =
 org.apache.spark.streaming.dstream.FlatMappedDStream@351cc4b5
 wordCounts: org.apache.spark.streaming.dstream.DStream[(String, Long)] =
 org.apache.spark.streaming.dstream.ShuffledDStream@ae04104
 WARN  [JobGenerator] kafka.utils.VerifiableProperties - Property
 schema.registry.url is not valid
 WARN  [task-result-getter-0] org.apache.spark.scheduler.TaskSetManager -
 Lost task 0.0 in stage 0.0 (TID 0, 10.3.30.87):
 java.lang.ClassNotFoundException:
 $line49.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1
 at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:348)
 at
 org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:66)

 at
 java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
 at
 java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
 at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
 at
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
 at
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
 at
 java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
 at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
 ..
 ..
 Driver stacktrace:
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1266)

 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1257)

 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1256)

 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

 at
 scala.collection.mutable.ArrayBuffer.foreach

Re: spark job progress-style report on console ?

2015-04-15 Thread syepes
Just add the following line spark.ui.showConsoleProgress  true do your
conf/spark-defaults.conf file.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-job-progress-style-report-on-console-tp22440p22506.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



EventLog / Timeline calculation - Optimization

2015-02-24 Thread syepes
Hello,

For the past days I have been trying to process and analyse with Spark a
Cassandra eventLog table similar to the one shown here.
Basically what I want to calculate is the delta time epoch between each
event type for all the device id's in the table. Currently its working as
expected but I am wondering if there is a better or more optimal way of
achieving this kind of calculation in Spark.

Note that to simplify the example I have removed all the Cassandra stuff and
just use a CSV file.

*eventLog.txt:*
dev_id,event_type,event_ts
-
1,loging,2015-01-03 01:15:00
1,activated,2015-01-03 01:10:00
1,register,2015-01-03 01:00:00
2,get_data,2015-01-02 01:00:10
2,loging,2015-01-02 01:00:00
3,update_data,2015-01-01 01:15:00
3,get_data,2015-01-01 01:10:00
3,loging,2015-01-01 01:00:00
-

*Spark Code:*
-
import java.sql.Timestamp
def getDateDiff( d1:String, d2:String) : Long = {
Timestamp.valueOf(d2).getTime() - Timestamp.valueOf(d1).getTime() }

val rawEvents = sc.textFile(eventLog.txt).map(_.split(,)).map(e =
(e(0).trim.toInt, e(1).trim, e(2).trim))
val indexed = rawEvents.zipWithIndex.map(_.swap)
val shifted = indexed.map{case (k,v) = (k-1,v)}
val joined = indexed.join(shifted)
val cleaned = joined.filter(x = x._2._1._1 == x._2._2._1) // Filter out
dev_id's that don't match
val eventDuration = cleaned.map{case (i,(v1,v2)) = (v1._1, s${v1._2} -
${v2._2}, getDateDiff(v2._3, v1._3)) }
eventDuration.collect.foreach(println)
-

*Output:*
-
(1,loging - activated,30)
(3,get_data - loging,60)
(1,activated - register,60)
(2,get_data - loging,1)
(3,update_data - get_data,30)


This code was inspired by the following posts:
http://stackoverflow.com/questions/26560292/apache-spark-distance-between-two-points-using-squareddistance
http://apache-spark-user-list.1001560.n3.nabble.com/Cumulative-distance-calculation-on-moving-objects-RDD-td20729.html
http://stackoverflow.com/questions/28236347/functional-approach-in-sequential-rdd-processing-apache-spark


Best regards and thanks in advance for any suggestions,
Sebastian 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/EventLog-Timeline-calculation-Optimization-tp21792.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org