Re: installation of spark

2019-06-05 Thread Alonso Isidoro Roman
When using osx, it is recommended to install java, scala and spark using
brew.

Run these commands on a terminal:

brew update

brew install scala

brew install sbt

brew cask install java

brew install spark


There is no need to install HDFS, you  can use your local file system
without a problem.


*How to set JAVA_HOME on Mac OS X **temporary *

   1. Open *Terminal*.
   2. Confirm you have JDK by typing “which java”. ...
   3. Check you have the needed version of Java, by typing “java -version”.
   4. *Set JAVA_HOME* using this command in *Terminal*: *export JAVA_HOME*
   =/Library/Java/Home.
   5. echo $*JAVA_HOME* on *Terminal* to confirm the path.
   6. You should now be able to run your application.


*How to set JAVA_HOME on Mac OS X permanently*

$ vim .bash_profile

$ export JAVA_HOME=$(/usr/libexec/java_home)

$ source .bash_profile

$ echo $JAVA_HOME


Have fun!

Alonso


El mié., 5 jun. 2019 a las 6:10, Jack Kolokasis ()
escribió:

> Hello,
>
> at first you will need to make sure that JAVA is installed, or install
> it otherwise. Then install scala and a build tool (sbt or maven). In my
> point of view, IntelliJ IDEA is a good option to create your Spark
> applications.  At the end you have to install a distributed file system e.g
> HDFS.
>
> I think there is no an all-in-one configuration. But there are
> examples about how to configure you Spark cluster (e.g
> https://github.com/jaceklaskowski/mastering-apache-spark-book/blob/master/spark-standalone-example-2-workers-on-1-node-cluster.adoc
> ).
> Best,
> --Iacovos
> On 5/6/19 5:50 π.μ., ya wrote:
>
> Dear list,
>
> I am very new to spark, and I am having trouble installing it on my mac. I
> have following questions, please give me some guidance. Thank you very much.
>
> 1. How many and what software should I install before installing spark? I
> have been searching online, people discussing their experiences on this
> topic with different opinions, some says there is no need to install hadoop
> before install spark, some says hadoop has to be installed before spark.
> Some other people say scala has to be installed, whereas others say scala
> is included in spark, and it is installed automatically once spark in
> installed. So I am confused what to install for a start.
>
> 2.  Is there an simple way to configure these software? for instance, an
> all-in-one configuration file? It takes forever for me to configure things
> before I can really use it for data analysis.
>
> I hope my questions make sense. Thank you very much.
>
> Best regards,
>
> YA
>
>

-- 
Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>


Re: write files of a specific size

2019-05-05 Thread Alonso Isidoro Roman
Check these links:

https://stackoverflow.com/questions/31610971/spark-repartition-vs-coalesce

https://medium.com/@mrpowers/managing-spark-partitions-with-coalesce-and-repartition-4050c57ad5c4



El dom., 5 may. 2019 a las 11:48, hemant singh ()
escribió:

> Based on size of the output data you can do the math of how many file you
> will need to produce 100MB files. Once you have number of files you can do
> coalesce or repartition depending on whether your job writes more or less
> output partitions.
>
> On Sun, 5 May 2019 at 2:21 PM, rajat kumar 
> wrote:
>
>> Hi All,
>> My spark sql job produces output as per default partition and creates N
>> number of files.
>> I want to create each file as 100Mb sized in the final result.
>>
>> how can I do it ?
>>
>> thanks
>> rajat
>>
>>

-- 
Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>


Re: [Spark2.1] SparkStreaming to Cassandra performance problem

2018-05-21 Thread Alonso Isidoro Roman
are working with dataframes and the kafkaStream instance does not
>> provide methods to convert an RDD into DF.
>>
>> Regarding my table, it is very simple (see below). Can I change something
>> to make it write faster?
>> CREATE TABLE test_hdpkns.measurement (
>>   mid bigint,
>>   tt timestamp,
>>   in_tt timestamp,
>>   out_tt timestamp,
>>   sensor_id int,
>>   measure double,
>>   PRIMARY KEY (mid, tt, sensor_id, in_tt, out_tt)
>> ) with compact storage;
>>
>> The system CPU while the demo is running is almost always at 100% for
>> both cores.
>>
>>
>> Thank you.
>>
>> Best Regards,
>>
>> On 29/04/2018 20:46:30, Javier Pareja <pareja.jav...@gmail.com> wrote:
>> Hi Saulo,
>>
>> I meant using this to save:
>>
>> https://github.com/datastax/spark-cassandra-connector/blob/master/doc/8_streaming.md#writing-to-cassandra-from-a-stream
>>
>> But it might be slow on a different area.
>> Another point is that Cassandra and spark running on the same machine
>> might compete for resources which will slow down the insert. You can check
>> the CPU usage of the machine at the time. Also the design of the table
>> schema can make a big difference.
>>
>>
>> On Sun, 29 Apr 2018, 19:02 Saulo Sobreiro, <saulo.sobre...@outlook.pt>
>> wrote:
>>
>>> Hi Javier,
>>>
>>>
>>> I removed the map and used "map" directly instead of using transform,
>>> but the *kafkaStream* is created with KafkaUtils which does not have a
>>> method to save to cassandra directly.
>>>
>>> Do you know any workarround for this?
>>>
>>>
>>> Thank you for the suggestion.
>>>
>>> Best Regards,
>>>
>>> On 29/04/2018 17:03:24, Javier Pareja <pareja.jav...@gmail.com> wrote:
>>> Hi Saulo,
>>>
>>> I'm no expert but I will give it a try.
>>> I would remove the rdd2.count(), I can't see the point and you will gain
>>> performance right away. Because of this, I would not use a transform, just
>>> directly the map.
>>> I have not used python but in Scala the cassandra-spark connector can
>>> save directly to Cassandra without a foreachRDD.
>>>
>>> Finally I would use the spark UI to find which stage is the bottleneck
>>> here.
>>>
>>> On Sun, 29 Apr 2018, 01:17 Saulo Sobreiro, <saulo.sobre...@outlook.pt>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I am implementing a use case where I read some sensor data from Kafka
>>>> with SparkStreaming interface (*KafkaUtils.createDirectStream*) and,
>>>> after some transformations, write the output (RDD) to Cassandra.
>>>>
>>>> Everything is working properly but I am having some trouble with the
>>>> performance. My kafka topic receives around 2000 messages per second. For a
>>>> 4 min. test, the SparkStreaming app takes 6~7 min. to process and write to
>>>> Cassandra, which is not acceptable for longer runs.
>>>>
>>>> I am running this application in a "sandbox" with 12GB of RAM, 2 cores
>>>> and 30GB SSD space.
>>>> Versions: Spark 2.1, Cassandra 3.0.9 (cqlsh 5.0.1).
>>>>
>>>> I would like to know you have some suggestion to improve performance
>>>> (other than getting more resources :) ).
>>>>
>>>> My code (pyspark) is posted in the end of this email so you can take a
>>>> look. I tried some different cassandra configurations following this link:
>>>> http://www.snappydata.io/blog/snappydata-memsql-cassandra-a-performance-benchmark
>>>> (recommended in stackoverflow for similar questions).
>>>>
>>>>
>>>> Thank you in advance,
>>>>
>>>> Best Regards,
>>>> Saulo
>>>>
>>>>
>>>>
>>>> === # CODE # =
>>>> 
>>>> # run command:
>>>> # spark2-submit --packages
>>>> org.apache.spark:spark-streaming-kafka_2.11:1.6.3,anguenot:pyspark-cassandra:0.7.0,org.apache.spark:spark-core_2.11:1.5.2
>>>>  --conf spark.cassandra.connection.host='localhost' --num-executors 2
>>>> --executor-cores 2 SensorDataStreamHandler.py localhost:6667 test_topic2
>>>> ##
>>>>
>>>> # Run Spark imports
>>>> from pyspark import SparkConf # SparkContext, SparkConf
>>>> from pyspark.streaming import StreamingContext
>>>> from pyspark.streaming.kafka import KafkaUtils
>>>>
>>>> # Run Cassandra imports
>>>> import pyspark_cassandra
>>>> from pyspark_cassandra import CassandraSparkContext, saveToCassandra
>>>>
>>>> def recordHandler(record):
>>>> (mid, tt, in_tt, sid, mv) = parseData( record )
>>>> return processMetrics(mid, tt, in_tt, sid, mv)
>>>>
>>>> def process(time, rdd):
>>>> rdd2 = rdd.map( lambda w: recordHandler(w[1]) )
>>>> if rdd2.count() > 0:
>>>> return rdd2
>>>>
>>>> def casssave(time, rdd):
>>>> rdd.saveToCassandra( "test_hdpkns", "measurement" )
>>>>
>>>> # ...
>>>> brokers, topic = sys.argv[1:]
>>>>
>>>> # ...
>>>>
>>>> sconf = SparkConf() \
>>>> .setAppName("SensorDataStreamHandler") \
>>>> .setMaster("local[*]") \
>>>> .set("spark.default.parallelism", "2")
>>>>
>>>> sc = CassandraSparkContext(conf = sconf)
>>>> batchIntervalSeconds = 2
>>>> ssc = StreamingContext(sc, batchIntervalSeconds)
>>>>
>>>> kafkaStream = KafkaUtils.createDirectStream(ssc, [topic],
>>>> {"metadata.broker.list": brokers})
>>>>
>>>> kafkaStream \
>>>> .transform(process) \
>>>> .foreachRDD(casssave)
>>>>
>>>> ssc.start()
>>>> ssc.awaitTermination()
>>>>
>>>> 
>>>>
>>>>
>>>>
>>>>
>>>>

-- 
Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>


Re: is it possible to create one KafkaDirectStream (Dstream) per topic?

2018-05-21 Thread Alonso Isidoro Roman
Check this thread
<https://stackoverflow.com/questions/37810709/kafka-topic-partitions-to-spark-streaming>
.

El lun., 21 may. 2018 a las 0:25, kant kodali (<kanth...@gmail.com>)
escribió:

> Hi All,
>
> I have 5 Kafka topics and I am wondering if is even possible to create one
> KafkaDirectStream (Dstream) per topic within the same JVM i.e using only
> one sparkcontext?
>
> Thanks!
>


-- 
Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>


Re: Testing Spark-Cassandra

2018-01-17 Thread Alonso Isidoro Roman
Yes, you can use docker to build your own cassandra ring. Depending your
SO, instructions may change, so, please, follow this
<https://yurisubach.com/2016/03/24/cassandra-docker-test-cluster/> link to
install it, and then follow this
<https://github.com/koeninger/spark-cassandra-example> project, but you
will have to adapt the necessary libraries to use spark 2.0.x version.

Good luck, i would like to see any blog post using this combination.



2018-01-17 16:48 GMT+01:00 Guillermo Ortiz <konstt2...@gmail.com>:

> Hello,
>
> I'm using spark 2.0 and Cassandra. Is there any util to make unit test
> easily or which one would be the best way to do it? library? Cassandra with
> docker?
>



-- 
Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>


Re: Multiple Kafka topics processing in Spark 2.2

2017-09-06 Thread Alonso Isidoro Roman
Hi, reading the official doc
<http://spark.apache.org/docs/latest/streaming-kafka-0-8-integration.html>,
i think you can do it this way:

import org.apache.spark.streaming.kafka._

   val directKafkaStream = KafkaUtils.createDirectStream[String,
String, StringDecoder, StringDecoder](

  ssc, kafkaParams, topicsSet)


 // Hold a reference to the current offset ranges, so it can be used downstream
 var offsetRanges = Array.empty[OffsetRange]

 directKafkaStream.transform { rdd =>
   offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
   rdd
 }.map {
   ...
 }.foreachRDD { rdd =>
   for (o <- offsetRanges) {
 println(*s"${o.topic}* ${o.partition} ${o.fromOffset} ${o.untilOffset}")
   }

 }


2017-09-06 14:38 GMT+02:00 Dan Dong <dongda...@gmail.com>:

> Hi, All,
>   I have one issue here about how to process multiple Kafka topics in a
> Spark 2.* program. My question is: How to get the topic name from a message
> received from Kafka? E.g:
>
> ..
> val messages = KafkaUtils.createDirectStream[String, String,
> StringDecoder, StringDecoder](
>   ssc, kafkaParams, topicsSet)
>
> // Get the lines, split them into words, count the words and print
> val lines = messages.map(_._2)
> val words = lines.flatMap(_.split(" "))
> val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _)
> wordCounts.print()
> ..
>
> Kafka send the messages in multiple topics through console producer for
> example. But when Spark receive the message, how it will know which topic
> is this piece of message coming from? Thanks a lot for any of your helps!
>
> Cheers,
> Dan
>



-- 
Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>


Re: Need some help around a Spark Error

2017-07-26 Thread Alonso Isidoro Roman
I hope that helps

https://stackoverflow.com/questions/40623957/slave-lost-and-very-slow-join-in-spark


Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>

2017-07-26 3:13 GMT+02:00 Debabrata Ghosh <mailford...@gmail.com>:

> Hi,
>   While executing a SparkSQL query, I am hitting the
> following error. Wonder, if you can please help me with a possible cause
> and resolution. Here is the stacktrace for the same:
>
> 07/25/2017 02:41:58 PM - DataPrep.py 323 - __main__ - ERROR - An error
> occurred while calling o49.sql.
>
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task
> 0 in stage 12.2 failed 4 times, most recent failure: Lost task 0.3 in stage
> 12.2 (TID 2242, bicservices.hdp.com): ExecutorLostFailure (executor 25
> exited caused by one of the running tasks) Reason: Slave lost
>
> Driver stacktrace:
>
> at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$
> scheduler$DAGScheduler$$failJobAndIndependentStages(
> DAGScheduler.scala:1433)
>
> at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(
> DAGScheduler.scala:1421)
>
> at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(
> DAGScheduler.scala:1420)
>
> at scala.collection.mutable.ResizableArray$class.foreach(
> ResizableArray.scala:59)
>
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>
> at org.apache.spark.scheduler.DAGScheduler.abortStage(
> DAGScheduler.scala:1420)
>
> at org.apache.spark.scheduler.DAGScheduler$$anonfun$
> handleTaskSetFailed$1.apply(DAGScheduler.scala:801)
>
> at org.apache.spark.scheduler.DAGScheduler$$anonfun$
> handleTaskSetFailed$1.apply(DAGScheduler.scala:801)
>
> at scala.Option.foreach(Option.scala:236)
>
> at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(
> DAGScheduler.scala:801)
>
> at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.
> doOnReceive(DAGScheduler.scala:1642)
>
> at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.
> onReceive(DAGScheduler.scala:1601)
>
> at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.
> onReceive(DAGScheduler.scala:1590)
>
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>
> at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:622)
>
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1856)
>
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1869)
>
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1946)
>
> at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.saveAsHiveFile(
> InsertIntoHiveTable.scala:84)
>
> at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.
> sideEffectResult$lzycompute(InsertIntoHiveTable.scala:201)
>
> at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.
> sideEffectResult(InsertIntoHiveTable.scala:127)
>
> at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.
> doExecute(InsertIntoHiveTable.scala:276)
>
> at org.apache.spark.sql.execution.SparkPlan$$anonfun$
> execute$5.apply(SparkPlan.scala:132)
>
> at org.apache.spark.sql.execution.SparkPlan$$anonfun$
> execute$5.apply(SparkPlan.scala:130)
>
> at org.apache.spark.rdd.RDDOperationScope$.withScope(
> RDDOperationScope.scala:150)
>
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
>
> at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(
> QueryExecution.scala:55)
>
> at org.apache.spark.sql.execution.QueryExecution.
> toRdd(QueryExecution.scala:55)
>
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:145)
>
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:130)
>
> at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
>
> at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
>
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
> at sun.reflect.NativeMethodAccessorImpl.invoke(
> NativeMethodAccessorImpl.java:62)
>
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:43)
>
> at java.lang.reflect.Method.invoke(Method.java:498)
>
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
>
> at py4j.Gateway.invoke(Gateway.java:259)
>
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
>
> at py4j.GatewayConnection.run(GatewayConnection.java:209)
> at java.lang.Thread.run(Thread.java:745)
>
> Cheers,
>
> Debu
>


Re: how to identify the alive master spark via Zookeeper ?

2017-07-17 Thread Alonso Isidoro Roman
Not sure if this can help, but a quick search on stackoverflow return this
<https://stackoverflow.com/questions/33504798/how-to-find-the-master-url-for-an-existing-spark-cluster>
and this one
<https://stackoverflow.com/questions/37420537/how-to-check-status-of-spark-applications-from-the-command-line>
...



Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>

2017-07-17 14:21 GMT+02:00 <marina.bru...@orange.com>:

> Hello,
>
>
>
>
>
> In our project, we have a Spark cluster with 2 master and 4 workers and
> Zookeeper decides which master is alive.
>
> We have a problem with our reverse proxy to display the Spark Web UI. The
> RP redirect on a master with IP address configured in initial configuration
> but if Zookeeper decides to change the master, our spark Web UI is not
> accessible because the IP address of master changed.
>
> We want to find dynamically which master is elected every time.
>
> We search in Internet a solution to know with Zookeeper which master is
> alive but we don’t find anything. It is possible with confd to search if
> property changed but none property is saved in Zookeeper. In folder /spark
> in Zookeeper, nothing is logged.
>
> Why Spark does not send property to Zookeeper to indicate which ip address
> or hostname is elected ? In your class ZooKeeperLeaderElectionAgent.scala,
> you logged which master is elected but perhaps it will be also a good
> solution to send a property to Zookeeper to indicate host.
>
>
>
> We already asked to Zookeeper user mailing list and they said that:
>
> “This question may be better suited for the Spark mailing lists as
> Zookeeper doesn't really "decide" which master is alive but rather provides
> a mechanism for the application to make the correct decision.”
>
>
>
> So, we think that we are not alone with this type of problem but we can’t
> find anything on Internet.
>
>
>
> Can you help us to solve this problem ?
>
> Regards,
>
> Marina
>
> _
>
> Ce message et ses pieces jointes peuvent contenir des informations 
> confidentielles ou privilegiees et ne doivent donc
> pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu 
> ce message par erreur, veuillez le signaler
> a l'expediteur et le detruire ainsi que les pieces jointes. Les messages 
> electroniques etant susceptibles d'alteration,
> Orange decline toute responsabilite si ce message a ete altere, deforme ou 
> falsifie. Merci.
>
> This message and its attachments may contain confidential or privileged 
> information that may be protected by law;
> they should not be distributed, used or copied without authorisation.
> If you have received this email in error, please notify the sender and delete 
> this message and its attachments.
> As emails may be altered, Orange is not liable for messages that have been 
> modified, changed or falsified.
> Thank you.
>
>


Re: Need Spark(Scala) Performance Tuning tips

2017-06-09 Thread Alonso Isidoro Roman
a quick search on google:

https://www.cloudera.com/documentation/enterprise/5-9-x/topics/admin_spark_tuning.html

https://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/

http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/

and of course, Jacek`s
<https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-tuning.html>



Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>

2017-06-09 14:50 GMT+02:00 Debabrata Ghosh <mailford...@gmail.com>:

> Hi,
>  I need some help / guidance in performance tuning
> Spark code written in Scala. Can you please help.
>
> Thanks
>


Re: Java SPI jar reload in Spark

2017-06-06 Thread Alonso Isidoro Roman
Hi, a quick search on google.

https://github.com/spark-jobserver/spark-jobserver/issues/130

Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>

2017-06-06 12:14 GMT+02:00 Jonnas Li(Contractor) <
zhongshuang...@envisioncn.com>:

> Thank for your quick response.
> These jars are used to define some customize business logic, and they can
> be treat as plug-ins in our business scenario. And the jars are
> developed/maintain by some third-party partners, this means there will be
> some version updating.
> My expectation is update the business code with restarting the spark
> streaming job, any suggestion will be very grateful.
>
> Regards
> Jonnas Li
>
> 发件人: Jörn Franke <jornfra...@gmail.com>
> 日期: 2017年6月6日 星期二 下午5:55
> 至: Zhongshuang Li <zhongshuang...@envisioncn.com>
> 抄送: "user@spark.apache.org" <user@spark.apache.org>
> 主题: Re: Java SPI jar reload in Spark
>
> Why do you need jar reloading? What functionality is executed during jar
> reloading. Maybe there is another way to achieve the same without jar
> reloading. In fact, it might be dangerous from a functional point of view-
> functionality in jar changed and all your computation is wrong.
>
> On 6. Jun 2017, at 11:35, Jonnas Li(Contractor) <
> zhongshuang...@envisioncn.com> wrote:
>
> I have a Spark Streaming application, which dynamically calling a jar
> (Java SPI), and the jar is called in a mapWithState() function, it was
> working fine for a long time.
> Recently, I got a requirement which required to reload the jar during
> runtime.
> But when the reloading is completed, the spark streaming job got failed,
> and I get the following exception, it seems the spark try to deserialize
> the checkpoint failed.
> My question is whether the logic in the jar will be serialized into
> checkpoint, and is it possible to do the jar reloading during runtime in
> Spark Streaming?
>
>
> [2017-06-06 17:13:12,185] WARN Lost task 1.0 in stage 5355.0 (TID 4817, 
> ip-10-21-14-205.envisioncn.com): java.lang.ClassCastException: cannot assign 
> instance of scala.collection.immutable.List$SerializationProxy to field 
> org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type 
> scala.collection.Seq in instance of 
> org.apache.spark.streaming.rdd.MapWithStateRDD
>   at 
> java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2083)
>   at 
> java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1261)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1996)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> scala.collection.immutable.List$SerializationProxy.readObject(List.scala:479)
>   at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStre

Re: Spark 2.1 - Infering schema of dataframe after reading json files not during

2017-06-02 Thread Alonso Isidoro Roman
not sure if this can help you, but you can infer programmatically the
schema providing a json schema file,

val path: Path = new Path(schema_parquet)
val fileSystem = path.getFileSystem(sc.hadoopConfiguration)

val inputStream: FSDataInputStream = fileSystem.open(path)

val schema_json = Stream.cons(inputStream.readLine(),
Stream.continually(inputStream.readLine))

logger.debug("schema_json looks like " + schema_json.head)

val mySchemaStructType =
DataType.fromJson(schema_json.head).asInstanceOf[StructType]

logger.debug("mySchemaStructType is " + mySchemaStructType)


where schema_parquet can be something like this:

{"type" : "struct","fields" : [ {"name" : "column0","type" :
"string","nullable" : false},{"name":"column1", "type":"string",
"nullable":false},{"name":"column2", "type":"string",
"nullable":true}, {"name":"column3", "type":"string",
"nullable":false}]}



Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>

2017-06-02 16:11 GMT+02:00 Aseem Bansal <asmbans...@gmail.com>:

> When we read files in spark it infers the schema. We have the option to
> not infer the schema. Is there a way to ask spark to infer the schema again
> just like when reading json?
>
> The reason we want to get this done is because we have a problem in our
> data files. We have a json file containing this
>
> {"a": NESTED_JSON_VALUE}
> {"a":"null"}
>
> It should have been empty json but due to a bug it became "null" instead.
> Now, when we read the file the column "a" is considered as a String.
> Instead what we want to do is ask spark to read the file considering "a" as
> a String, filter the "null" out/replace with empty json and then ask spark
> to infer schema of "a" after the fix so we can access the nested json
> properly.
>


Re: Worker node log not showed

2017-05-31 Thread Alonso Isidoro Roman
Are you running the code with yarn?

if so, figure out the applicationID through the web ui, then run the next
command:

yarn logs your_application_id

Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>

2017-05-31 9:42 GMT+02:00 Paolo Patierno <ppatie...@live.com>:

> Hi all,
>
>
> I have a simple cluster with one master and one worker. On another machine
> I launch the driver where at some point I have following line of codes :
>
>
> max.foreachRDD(rdd -> {
>
> LOG.info("*** max.foreachRDD");
>
> rdd.foreach(value -> {
>
> LOG.info("*** rdd.foreach");
>
> });
> });
>
>
> The message "*** max.foreachRDD" is visible in the console of the driver
> machine ... and it's ok.
> I can't see the "*** rdd.foreach" message that should be executed on the
> worker node right ? Btw on the worker node console I can't see it. Why ?
>
> My need is to log what happens in the code executed on worker node
> (because it works if I execute it on master local[*] but not when submitted
> to a worker node).
>
> Following the log4j.properties file I put in the /conf dir :
>
> # Set everything to be logged to the console
> log4j.rootCategory=INFO, console
> log4j.appender.console=org.apache.log4j.ConsoleAppender
> log4j.appender.console.target=System.err
> log4j.appender.console.layout=org.apache.log4j.PatternLayout
> log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p 
> %c{1}: %m%n
>
> # Settings to quiet third party logs that are too verbose
> log4j.logger.org.spark-project.jetty=WARN
> log4j.logger.org.spark-project.jetty.util.component.AbstractLifeCycle=ERROR
> log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
> log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
>
>
>
> Thanks
> Paolo.
>
>
> *Paolo Patierno*
>
> *Senior Software Engineer (IoT) @ Red Hat **Microsoft MVP on **Windows
> Embedded & IoT*
> *Microsoft Azure Advisor*
>
> Twitter : @ppatierno <http://twitter.com/ppatierno>
> Linkedin : paolopatierno <http://it.linkedin.com/in/paolopatierno>
> Blog : DevExperience <http://paolopatierno.wordpress.com/>
>


Re: Spark Streaming: Custom Receiver OOM consistently

2017-05-21 Thread Alonso Isidoro Roman
could you share the code?

Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>

2017-05-20 7:54 GMT+02:00 Manish Malhotra <manish.malhotra.w...@gmail.com>:

> Hello,
>
> have implemented Java based custom receiver, which consumes from
> messaging system say JMS.
> once received message, I call store(object) ... Im storing spark Row
> object.
>
> it run for around 8 hrs, and then goes OOM, and OOM is happening in
> receiver nodes.
> I also tried to run multiple receivers, to distribute the load but faces
> the same issue.
>
> something fundamentally we are doing wrong, which tells custom receiver/spark
> to release the memory.
> but Im not able to crack that, atleast till now.
>
> any help is appreciated !!
>
> Regards,
> Manish
>
>


Re: How to read large size files from a directory ?

2017-05-09 Thread Alonso Isidoro Roman
please create a github repo and upload the code there...


Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>

2017-05-09 8:47 GMT+02:00 ashwini anand <aayan...@gmail.com>:

> I am reading each file of a directory using wholeTextFiles. After that I
> am calling a function on each element of the rdd using map . The whole
> program uses just 50 lines of each file. The code is as below: def
> processFiles(fileNameContentsPair): fileName= fileNameContentsPair[0]
> result = "\n\n"+fileName resultEr = "\n\n"+fileName input =
> StringIO.StringIO(fileNameContentsPair[1]) reader =
> csv.reader(input,strict=True) try: i=0 for row in reader: if i==50: break
> // do some processing and get result string i=i+1 except csv.Error as e:
> resultEr = resultEr +"error occured\n\n" return resultEr return result if
> __name__ == "__main__": inputFile = sys.argv[1] outputFile = sys.argv[2] sc
> = SparkContext(appName = "SomeApp") resultRDD =
> sc.wholeTextFiles(inputFile).map(processFiles) 
> resultRDD.saveAsTextFile(outputFile)
> The size of each file of the directory can be very large in my case and
> because of this reason use of wholeTextFiles api will be inefficient in
> this case. Right now wholeTextFiles loads full file content into the
> memory. can we make wholeTextFiles to load only first 50 lines of each file
> ? Apart from using wholeTextFiles, other solution I can think of is
> iterating over each file of the directory one by one but that also seems to
> be inefficient. I am new to spark. Please let me know if there is any
> efficient way to do this.
> --
> View this message in context: How to read large size files from a
> directory ?
> <http://apache-spark-user-list.1001560.n3.nabble.com/How-to-read-large-size-files-from-a-directory-tp28669.html>
> Sent from the Apache Spark User List mailing list archive
> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>


Re: Has anyone used CoreNLP from stanford for sentiment analysis in Spark? It does not work as desired for me.

2017-04-28 Thread Alonso Isidoro Roman
I forked some time ago a twitter analyzer, but i think the best is to
provide the original link
<https://github.com/vspiewak/twitter-sentiment-analysis>.

If you want you can take a look to my fork
<https://github.com/alonsoir/twitter-sentiment-analysis>.

regards


Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>

2017-04-28 10:05 GMT+02:00 Gaurav1809 <gauravhpan...@gmail.com>:

> Has anyone used CoreNLP from stanford for sentiment analysis in Spark?
> It is not working as desired or may be I need to do some work which I am
> not
> aware of.
>
> Following is the example.
>
> 1). I look forward to interacting with kids of states governed by the
> congress. - POSITIVE
> 2). I look forward to interacting with CM of states governed by the
> congress. - NEGETIVE (CM is chief minister)
>
> Please note the change in one word here. kids -> CM
> Statement 2 is not negetive but coreNLP tagged it as negetive.
> is there anything I need to do to make it work as desired? Any alteration
> required? Please let me know if I need to plug-in any custom code.
> Whoever has knowledge on this, please suggest something.
> Also, suggest if there is any other better alternate to coreNLP.
>
> Thanks.
> Gaurav
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Has-anyone-used-CoreNLP-from-stanford-for-sentiment-
> analysis-in-Spark-It-does-not-work-as-desired-fo-tp28634.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: 答复: 答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-20 Thread Alonso Isidoro Roman
forgive my ignorance, but, what does it mean HAR? a acronym to High
available record?

Thanks

Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>

2017-04-20 10:58 GMT+02:00 莫涛 <mo...@sensetime.com>:

> Hi Jörn,
>
>
> HAR is a great idea!
>
>
> For POC, I've archived 1M records and stored the id -> path mapping in
> text (for better readability).
>
> Filtering 1K records takes only 2 minutes now (30 seconds to get the path
> list and 0.5 second per thread to read a record).
>
> Such performance is exactly what I expected: "only the requested BINARY
> are scanned".
>
> Moreover, HAR provides directly access to each record by hdfs shell
> command.
>
>
> Thank you very much!
> --
> *发件人:* Jörn Franke <jornfra...@gmail.com>
> *发送时间:* 2017年4月17日 22:37:48
> *收件人:* 莫涛
> *抄送:* user@spark.apache.org
> *主题:* Re: 答复: How to store 10M records in HDFS to speed up further
> filtering?
>
> Yes 5 mb is a difficult size, too small for HDFS too big for parquet/orc.
> Maybe you can put the data in a HAR and store id, path in orc/parquet.
>
> On 17. Apr 2017, at 10:52, 莫涛 <mo...@sensetime.com> wrote:
>
> Hi Jörn,
>
>
> I do think a 5 MB column is odd but I don't have any other idea before
> asking this question. The binary data is a short video and the maximum
> size is no more than 50 MB.
>
>
> Hadoop archive sounds very interesting and I'll try it first to check
> whether filtering is fast on it.
>
>
> To my best knowledge, HBase works best for record around hundreds of KB
> and it requires extra work of the cluster administrator. So this would be
> the last option.
>
>
> Thanks!
>
>
> Mo Tao
> --
> *发件人:* Jörn Franke <jornfra...@gmail.com>
> *发送时间:* 2017年4月17日 15:59:28
> *收件人:* 莫涛
> *抄送:* user@spark.apache.org
> *主题:* Re: How to store 10M records in HDFS to speed up further filtering?
>
> You need to sort the data by id otherwise q situation can occur where the
> index does not work. Aside from this, it sounds odd to put a 5 MB column
> using those formats. This will be also not so efficient.
> What is in the 5 MB binary data?
> You could use HAR or maybe Hbase to store this kind of data (if it does
> not get much larger than 5 MB).
>
> > On 17. Apr 2017, at 08:23, MoTao <mo...@sensetime.com> wrote:
> >
> > Hi all,
> >
> > I have 10M (ID, BINARY) record, and the size of each BINARY is 5MB on
> > average.
> > In my daily application, I need to filter out 10K BINARY according to an
> ID
> > list.
> > How should I store the whole data to make the filtering faster?
> >
> > I'm using DataFrame in Spark 2.0.0 and I've tried row-based format (avro)
> > and column-based format (orc).
> > However, both of them require to scan almost ALL records, making the
> > filtering stage very very slow.
> > The code block for filtering looks like:
> >
> > val IDSet: Set[String] = ...
> > val checkID = udf { ID: String => IDSet(ID) }
> > spark.read.orc("/path/to/whole/data")
> >  .filter(checkID($"ID"))
> >  .select($"ID", $"BINARY")
> >  .write...
> >
> > Thanks for any advice!
> >
> >
> >
> >
> > --
> > View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/How-to-store-10M-records-in-HDFS-to-
> speed-up-further-filtering-tp28605.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
>
>


Re: Any NLP library for sentiment analysis in Spark?

2017-04-12 Thread Alonso Isidoro Roman
I forked some time ago a project, maybe you can use it.

https://github.com/alonsoir/SparkTwitterAnalyzer



Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>

2017-04-12 8:45 GMT+02:00 Jayant Shekhar <jayantbaya...@gmail.com>:

> Hello Gaurav,
>
> Yes, Stanford CoreNLP is of course great to use too!
>
> You can find sample code here and pull the UDF's into your project :
> https://github.com/sparkflows/sparkflows-stanfordcorenlp
>
> Thanks,
> Jayant
>
>
> On Tue, Apr 11, 2017 at 8:44 PM, Gaurav Pandya <gauravhpan...@gmail.com>
> wrote:
>
>> Thanks guys.
>> How about Standford CoreNLP?
>> Any reviews/ feedback?
>> Please share the details if anyone has used it in past.
>>
>>
>> On Tue, Apr 11, 2017 at 11:46 PM, <ian.malo...@tdameritrade.com> wrote:
>>
>>> I think team used this awhile ago, but there was some tweak that needed
>>> to be made to get it to work.
>>>
>>> https://github.com/databricks/spark-corenlp
>>>
>>> From: Gabriel James <gabriel.ja...@heliase.com>> gabriel.ja...@heliase.com>>
>>> Organization: Heliase Genomics
>>> Reply-To: "gabriel.ja...@heliase.com<mailto:gabriel.ja...@heliase.com>"
>>> <gabriel.ja...@heliase.com<mailto:gabriel.ja...@heliase.com>>
>>> Date: Tuesday, April 11, 2017 at 2:13 PM
>>> To: 'Kevin Wang' <buz...@gmail.com<mailto:buz...@gmail.com>>, 'Alonso
>>> Isidoro Roman' <alons...@gmail.com<mailto:alons...@gmail.com>>
>>> Cc: 'Gaurav1809' <gauravhpan...@gmail.com<mailto:gauravhpan...@gmail.com>>,
>>> "user@spark.apache.org<mailto:user@spark.apache.org>" <
>>> user@spark.apache.org<mailto:user@spark.apache.org>>
>>> Subject: RE: Any NLP library for sentiment analysis in Spark?
>>>
>>> Me too. Experiences and recommendations please.
>>>
>>> Gabriel
>>>
>>> From: Kevin Wang [mailto:buz...@gmail.com]
>>> Sent: Wednesday, April 12, 2017 6:11 AM
>>> To: Alonso Isidoro Roman <alons...@gmail.com<mailto:alons...@gmail.com>>
>>> Cc: Gaurav1809 <gauravhpan...@gmail.com<mailto:gauravhpan...@gmail.com>>;
>>> user@spark.apache.org<mailto:user@spark.apache.org>
>>> Subject: Re: Any NLP library for sentiment analysis in Spark?
>>>
>>> I am also interested in this topic.  Anything else anyone can
>>> recommend?  Thanks.
>>>
>>> Best,
>>>
>>> Kevin
>>>
>>> On Tue, Apr 11, 2017 at 5:00 AM, Alonso Isidoro Roman <
>>> alons...@gmail.com<mailto:alons...@gmail.com>> wrote:
>>> i did not use it yet, but this library looks promising:
>>>
>>> https://github.com/databricks/spark-corenlp<https://urldefen
>>> se.proofpoint.com/v2/url?u=https-3A__github.com_databricks_
>>> spark-2Dcorenlp=DwMFaQ=nulvIAQnC0yOOjC0e0NVa8TOcyq9jNhjZ
>>> 156R-JJU10=CxpqDYMuQy-1uNI-UOyUbaX6BMPCZXH8d8evuCoP_OA=
>>> p0pC_ApR9n9mGtHzisoah42mDvjxDWoGZUhPltYAqWM=g6w0vu-jlVZ1dC
>>> TNujjMes1yndAzWXfqKCfKXAinx9c=>
>>>
>>>
>>> Alonso Isidoro Roman
>>> about.me/alonso.isidoro.roman
>>>
>>>
>>> 2017-04-11 11:02 GMT+02:00 Gaurav1809 <gauravhpan...@gmail.com>> gauravhpan...@gmail.com>>:
>>> Hi All,
>>>
>>> I need to determine sentiment for given document (statement, paragraph
>>> etc.)
>>> Is there any NLP library available with Apache Spark that I can use here?
>>>
>>> Any other pointers towards this would be highly appreciated.
>>>
>>> Thanks in advance.
>>> Gaurav Pandya
>>>
>>>
>>>
>>> --
>>> View this message in context: http://apache-spark-user-list.
>>> 1001560.n3.nabble.com/Any-NLP-library-for-sentiment-analysis
>>> -in-Spark-tp28586.html<https://urldefense.proofpoint.com/v2/
>>> url?u=http-3A__apache-2Dspark-2Duser-2Dlist.1001560.n3.nabbl
>>> e.com_Any-2DNLP-2Dlibrary-2Dfor-2Dsentiment-2Danalysis-
>>> 2Din-2DSpark-2Dtp28586.html=DwMFaQ=nulvIAQnC0yOOjC0e0NVa
>>> 8TOcyq9jNhjZ156R-JJU10=CxpqDYMuQy-1uNI-UOyUbaX6BMPCZX
>>> H8d8evuCoP_OA=p0pC_ApR9n9mGtHzisoah42mDvjxDWoGZUh
>>> PltYAqWM=9GDW47Xe8kvyCZSUXA02Q9fmpPopBEopcbcww6kXQpI=>
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org>> user-unsubscr...@spark.apache.org>
>>>
>>>
>>>
>>
>


Re: Any NLP library for sentiment analysis in Spark?

2017-04-11 Thread Alonso Isidoro Roman
i did not use it yet, but this library looks promising:

https://github.com/databricks/spark-corenlp


Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>

2017-04-11 11:02 GMT+02:00 Gaurav1809 <gauravhpan...@gmail.com>:

> Hi All,
>
> I need to determine sentiment for given document (statement, paragraph
> etc.)
> Is there any NLP library available with Apache Spark that I can use here?
>
> Any other pointers towards this would be highly appreciated.
>
> Thanks in advance.
> Gaurav Pandya
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Any-NLP-library-for-sentiment-
> analysis-in-Spark-tp28586.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Benchmarking streaming frameworks

2017-04-03 Thread Alonso Isidoro Roman
I remember that yahoo did something similar...


https://github.com/yahoo/streaming-benchmarks


Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>

2017-04-03 9:41 GMT+02:00 gvdongen <giselle.vandon...@ugent.be>:

> Dear users of Streaming Technologies,
>
> As a PhD student in big data analytics, I am currently in the process of
> compiling a list of benchmarks (to test multiple streaming frameworks) in
> order to create an expanded benchmarking suite. The benchmark suite is
> being
> developed as a part of my current work at Ghent University.
>
> The included frameworks at this time are, in no particular order, Spark,
> Flink, Kafka (Streams), Storm (Trident) and Drizzle. Any pointers to
> previous work or relevant benchmarks would be appreciated.
>
> Best regards,
> Giselle van Dongen
>
> --
> View this message in context: Benchmarking streaming frameworks
> <http://apache-spark-user-list.1001560.n3.nabble.com/Benchmarking-streaming-frameworks-tp28557.html>
>
> Sent from the Apache Spark User List mailing list archive
> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>


Re: How best we can store streaming data on dashboards for real time user experience?

2017-03-30 Thread Alonso Isidoro Roman
you can check if you want this link
<https://github.com/alonsoir/twitter-sentiment-analysis>


elastic, kibana and spark working together.

Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>

2017-03-30 16:25 GMT+02:00 Miel Hostens <miel.host...@ugent.be>:

> We're doing exactly same thing over here!
>
> Spark + ELK
>
> Met vriendelijk groeten,
>
> *Miel Hostens*,
>
> *B**uitenpraktijk+*
> Department of Reproduction, Obstetrics and Herd Health
> <http://www.rohh.ugent.be/>
>
> Ambulatory Clinic
> Faculty of Veterinary Medicine <http://www.ugent.be/di>
> Ghent University  <http://www.ugent.be/>
> Salisburylaan 133
> 9820 Merelbeke
> Belgium
>
>
>
> Tel:   +32 9 264 75 28 <+32%209%20264%2075%2028>
>
> Fax:  +32 9 264 77 97 <+32%209%20264%2077%2097>
>
> Mob: +32 478 59 37 03 <+32%20478%2059%2037%2003>
>
>  e-mail: miel.host...@ugent.be
>
> On 30 March 2017 at 15:01, Szuromi Tamás <trom...@gmail.com> wrote:
>
>> For us, after some Spark Streaming transformation, Elasticsearch + Kibana
>> is a great combination to store and visualize data.
>> An alternative solution that we use is Spark Streaming put some data back
>> to Kafka and we consume it with nodejs.
>>
>> Cheers,
>> Tamas
>>
>> 2017-03-30 9:25 GMT+02:00 Alonso Isidoro Roman <alons...@gmail.com>:
>>
>>> Read this first:
>>>
>>> http://www.oreilly.com/data/free/big-data-analytics-emerging
>>> -architecture.csp
>>>
>>> https://www.ijircce.com/upload/2015/august/97_A%20Study.pdf
>>>
>>> http://www.pentaho.com/assets/pdf/CqPxTROXtCpfoLrUi4Bj.pdf
>>>
>>> http://www.gartner.com/smarterwithgartner/six-best-practices
>>> -for-real-time-analytics/
>>>
>>> https://speakerdeck.com/elasticsearch/using-elasticsearch-lo
>>> gstash-and-kibana-to-create-realtime-dashboards
>>>
>>> https://www.youtube.com/watch?v=PuvHINcU9DI
>>>
>>> then take a look to
>>>
>>> https://kudu.apache.org/
>>>
>>> Tell us later what you think.
>>>
>>>
>>>
>>>
>>> Alonso Isidoro Roman
>>> [image: https://]about.me/alonso.isidoro.roman
>>>
>>> <https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>
>>>
>>> 2017-03-30 7:14 GMT+02:00 Gaurav Pandya <gauravhpan...@gmail.com>:
>>>
>>>> Hi Noorul,
>>>>
>>>> Thanks for the reply.
>>>> But then how to build the dashboard report? Don't we need to store data
>>>> anywhere?
>>>> Please suggest.
>>>>
>>>> Thanks.
>>>> Gaurav
>>>>
>>>> On Thu, Mar 30, 2017 at 10:32 AM, Noorul Islam Kamal Malmiyoda <
>>>> noo...@noorul.com> wrote:
>>>>
>>>>> I think better place would be a in memory cache for real time.
>>>>>
>>>>> Regards,
>>>>> Noorul
>>>>>
>>>>> On Thu, Mar 30, 2017 at 10:31 AM, Gaurav1809 <gauravhpan...@gmail.com>
>>>>> wrote:
>>>>> > I am getting streaming data and want to show them onto dashboards in
>>>>> real
>>>>> > time?
>>>>> > May I know how best we can handle these streaming data? where to
>>>>> store? (DB
>>>>> > or HDFS or ???)
>>>>> > I want to give users a real time analytics experience.
>>>>> >
>>>>> > Please suggest possible ways. Thanks.
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> > --
>>>>> > View this message in context: http://apache-spark-user-list.
>>>>> 1001560.n3.nabble.com/How-best-we-can-store-streaming-data-o
>>>>> n-dashboards-for-real-time-user-experience-tp28548.html
>>>>> > Sent from the Apache Spark User List mailing list archive at
>>>>> Nabble.com.
>>>>> >
>>>>> > 
>>>>> -
>>>>> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>>> >
>>>>>
>>>>
>>>>
>>>
>>
>


Re: How best we can store streaming data on dashboards for real time user experience?

2017-03-30 Thread Alonso Isidoro Roman
Read this first:

http://www.oreilly.com/data/free/big-data-analytics-emerging-architecture.csp

https://www.ijircce.com/upload/2015/august/97_A%20Study.pdf

http://www.pentaho.com/assets/pdf/CqPxTROXtCpfoLrUi4Bj.pdf

http://www.gartner.com/smarterwithgartner/six-best-practices-for-real-time-analytics/

https://speakerdeck.com/elasticsearch/using-elasticsearch-logstash-and-kibana-to-create-realtime-dashboards

https://www.youtube.com/watch?v=PuvHINcU9DI

then take a look to

https://kudu.apache.org/

Tell us later what you think.




Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>

2017-03-30 7:14 GMT+02:00 Gaurav Pandya <gauravhpan...@gmail.com>:

> Hi Noorul,
>
> Thanks for the reply.
> But then how to build the dashboard report? Don't we need to store data
> anywhere?
> Please suggest.
>
> Thanks.
> Gaurav
>
> On Thu, Mar 30, 2017 at 10:32 AM, Noorul Islam Kamal Malmiyoda <
> noo...@noorul.com> wrote:
>
>> I think better place would be a in memory cache for real time.
>>
>> Regards,
>> Noorul
>>
>> On Thu, Mar 30, 2017 at 10:31 AM, Gaurav1809 <gauravhpan...@gmail.com>
>> wrote:
>> > I am getting streaming data and want to show them onto dashboards in
>> real
>> > time?
>> > May I know how best we can handle these streaming data? where to store?
>> (DB
>> > or HDFS or ???)
>> > I want to give users a real time analytics experience.
>> >
>> > Please suggest possible ways. Thanks.
>> >
>> >
>> >
>> >
>> > --
>> > View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/How-best-we-can-store-streaming-data-
>> on-dashboards-for-real-time-user-experience-tp28548.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> >
>> > -
>> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> >
>>
>
>


Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Alonso Isidoro Roman
"Using Spark to query the data in the backend of the web UI?"

Dont do that. I would recommend that spark streaming process stores data
into some nosql or sql database and the web ui to query data from that
database.

Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>

2016-09-29 16:15 GMT+02:00 Ali Akhtar <ali.rac...@gmail.com>:

> The web UI is actually the speed layer, it needs to be able to query the
> data online, and show the results in real-time.
>
> It also needs a custom front-end, so a system like Tableau can't be used,
> it must have a custom backend + front-end.
>
> Thanks for the recommendation of Flume. Do you think this will work:
>
> - Spark Streaming to read data from Kafka
> - Storing the data on HDFS using Flume
> - Using Spark to query the data in the backend of the web UI?
>
>
>
> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> You need a batch layer and a speed layer. Data from Kafka can be stored
>> on HDFS using flume.
>>
>> -  Query this data to generate reports / analytics (There will be a web
>> UI which will be the front-end to the data, and will show the reports)
>>
>> This is basically batch layer and you need something like Tableau or
>> Zeppelin to query data
>>
>> You will also need spark streaming to query data online for speed layer.
>> That data could be stored in some transient fabric like ignite or even
>> druid.
>>
>> HTH
>>
>>
>>
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 29 September 2016 at 15:01, Ali Akhtar <ali.rac...@gmail.com> wrote:
>>
>>> It needs to be able to scale to a very large amount of data, yes.
>>>
>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma <deepakmc...@gmail.com>
>>> wrote:
>>>
>>>> What is the message inflow ?
>>>> If it's really high , definitely spark will be of great use .
>>>>
>>>> Thanks
>>>> Deepak
>>>>
>>>> On Sep 29, 2016 19:24, "Ali Akhtar" <ali.rac...@gmail.com> wrote:
>>>>
>>>>> I have a somewhat tricky use case, and I'm looking for ideas.
>>>>>
>>>>> I have 5-6 Kafka producers, reading various APIs, and writing their
>>>>> raw data into Kafka.
>>>>>
>>>>> I need to:
>>>>>
>>>>> - Do ETL on the data, and standardize it.
>>>>>
>>>>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS
>>>>> / ElasticSearch / Postgres)
>>>>>
>>>>> - Query this data to generate reports / analytics (There will be a web
>>>>> UI which will be the front-end to the data, and will show the reports)
>>>>>
>>>>> Java is being used as the backend language for everything (backend of
>>>>> the web UI, as well as the ETL layer)
>>>>>
>>>>> I'm considering:
>>>>>
>>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer
>>>>> (receive raw data from Kafka, standardize & store it)
>>>>>
>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized
>>>>> data, and to allow queries
>>>>>
>>>>> - In the backend of the web UI, I could either use Spark to run
>>>>> queries across the data (mostly filters), or directly run queries against
>>>>> Cassandra / HBase
>>>>>
>>>>> I'd appreciate some thoughts / suggestions on which of these
>>>>> alternatives I should go with (e.g, using raw Kafka consumers vs Spark for
>>>>> ETL, which persistent data store to use, and how to query that data store
>>>>> in the backend of the web UI, for displaying the reports).
>>>>>
>>>>>
>>>>> Thanks.
>>>>>
>>>>
>>>
>>
>


Re: What is the difference between mini-batch vs real time streaming in practice (not theory)?

2016-09-27 Thread Alonso Isidoro Roman
mini batch or near real time: processing frames within 500 ms or more

real time: processing frames in 5 ms-10ms.

The main difference is processing velocity, i think.

Apache Spark Streaming is mini batch, not true real time.

Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>

2016-09-27 11:15 GMT+02:00 kant kodali <kanth...@gmail.com>:

> I understand the difference between fraud detection and fraud prevention
> in general but I am not interested in the semantic war on what these terms
> precisely mean. I am more interested in understanding the difference
> between mini-batch vs real time streaming from CS perspective.
>
>
>
> On Tue, Sep 27, 2016 12:54 AM, Mich Talebzadeh mich.talebza...@gmail.com
> wrote:
>
>> Replace mini-batch with micro-batching and do a search again. what is
>> your understanding of fraud detection?
>>
>> Spark streaming can be used for risk calculation and fraud detection
>> (including stopping fraud going through for example credit card
>> fraud) effectively "in practice". it can even be used for Complex Event
>> Processing.
>>
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 27 September 2016 at 08:12, kant kodali <kanth...@gmail.com> wrote:
>>
>> What is the difference between mini-batch vs real time streaming in
>> practice (not theory)? In theory, I understand mini batch is something that
>> batches in the given time frame whereas real time streaming is more like do
>> something as the data arrives but my biggest question is why not have mini
>> batch with epsilon time frame (say one millisecond) or I would like to
>> understand reason why one would be an effective solution than other?
>> I recently came across one example where mini-batch (Apache Spark) is
>> used for Fraud detection and real time streaming (Apache Flink) used for
>> Fraud Prevention. Someone also commented saying mini-batches would not be
>> an effective solution for fraud prevention (since the goal is to prevent
>> the transaction from occurring as it happened) Now I wonder why this
>> wouldn't be so effective with mini batch (Spark) ? Why is it not effective
>> to run mini-batch with 1 millisecond latency? Batching is a technique used
>> everywhere including the OS and the Kernel TCP/IP stack where the data to
>> the disk or network are indeed buffered so what is the convincing factor
>> here to say one is more effective than other?
>> Thanks,
>> kant
>>
>>
>>


Re: Extract timestamp from Kafka message

2016-09-26 Thread Alonso Isidoro Roman
hum, i think you have to embed the timestamp within the message...

Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>

2016-09-26 0:59 GMT+02:00 Kevin Tran <kevin...@gmail.com>:

> Hi Everyone,
> Does anyone know how could we extract timestamp from Kafka message in
> Spark streaming ?
>
> JavaPairInputDStream<String, String> messagesDStream =
> KafkaUtils.createDirectStream(
>ssc,
>String.class,
>String.class,
>StringDecoder.class,
>StringDecoder.class,
>kafkaParams,
>topics
>);
>
>
> Thanks,
> Kevin.
>
>
>
>
>
>
>


Re: Small files

2016-09-12 Thread Alonso Isidoro Roman
Hi Ayan,

"My problem is to get data on to HDFS for the first time."

well, you have to put them on the cluster. With this simple command you can
load them within HDFS:

hdfs dfs -put $LOCAL_SRC_DIR $HDFS_PATH

Then, i think you have to use coalesce in order to create an uber super
mega file :) but i did not have to do it in my life, yet, so, maybe i am
wrong.

Please, take a look to this post and let us know about how you deal with it.

https://stuartsierra.com/2008/04/24/a-million-little-files


http://blog.cloudera.com/blog/2009/02/the-small-files-problem/


"One way I can think of is to load small files on each cluster machines on
the same folder. For example load file 1-0.3 mil on machine 1, 0.3-0.6 mil
on machine 2 and so on. Then I can run spark jobs which will locally read
files. "


Well, hadoop does not work such that way, when you load data within a
hadoop cluster, data are going to be allocated between every machine
belonging to your cluster, and the files are going to be splitted between
machines. I think you are trying to talk about data locality, isn't ?

"Any better solution? Can flume help here?"

Of course Flume can do the job, but you still will have the small files
problem anyway. You have to create an uber file before you upload it to the
HDFS.

Regards




Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>

2016-09-12 14:11 GMT+02:00 ayan guha <guha.a...@gmail.com>:

> Hi
>
> Thanks for your mail. I have read few of those posts. But always I see
> solutions assume data is on hdfs already. My problem is to get data on to
> HDFS for the first time.
>
> One way I can think of is to load small files on each cluster machines on
> the same folder. For example load file 1-0.3 mil on machine 1, 0.3-0.6 mil
> on machine 2 and so on. Then I can run spark jobs which will locally read
> files.
>
> Any better solution? Can flume help here?
>
> Any idea is appreciated.
>
> Best
> Ayan
> On 12 Sep 2016 20:54, "Alonso Isidoro Roman" <alons...@gmail.com> wrote:
>
>> That is a good question Ayan. A few searches on so returns me:
>>
>> http://stackoverflow.com/questions/31009834/merge-multiple-
>> small-files-in-to-few-larger-files-in-spark
>>
>> http://stackoverflow.com/questions/29025147/how-can-i-merge-
>> spark-results-files-without-repartition-and-copymerge
>>
>>
>> good luck, tell us something about this issue
>>
>> Alonso
>>
>>
>> Alonso Isidoro Roman
>> [image: https://]about.me/alonso.isidoro.roman
>>
>> <https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>
>>
>> 2016-09-12 12:39 GMT+02:00 ayan guha <guha.a...@gmail.com>:
>>
>>> Hi
>>>
>>> I have a general question: I have 1.6 mil small files, about 200G all
>>> put together. I want to put them on hdfs for spark processing.
>>> I know sequence file is the way to go because putting small files on
>>> hdfs is not correct practice. Also, I can write a code to consolidate small
>>> files to seq files locally.
>>> My question: is there any way to do this in parallel, for example using
>>> spark or mr or anything else.
>>>
>>> Thanks
>>> Ayan
>>>
>>
>>


Re: Small files

2016-09-12 Thread Alonso Isidoro Roman
That is a good question Ayan. A few searches on so returns me:

http://stackoverflow.com/questions/31009834/merge-multiple-small-files-in-to-few-larger-files-in-spark

http://stackoverflow.com/questions/29025147/how-can-i-merge-spark-results-files-without-repartition-and-copymerge


good luck, tell us something about this issue

Alonso


Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>

2016-09-12 12:39 GMT+02:00 ayan guha <guha.a...@gmail.com>:

> Hi
>
> I have a general question: I have 1.6 mil small files, about 200G all put
> together. I want to put them on hdfs for spark processing.
> I know sequence file is the way to go because putting small files on hdfs
> is not correct practice. Also, I can write a code to consolidate small
> files to seq files locally.
> My question: is there any way to do this in parallel, for example using
> spark or mr or anything else.
>
> Thanks
> Ayan
>


Re: Real Time Recommendation Engines with Spark and Scala

2016-09-05 Thread Alonso Isidoro Roman
By the way, i would love to work in your project, looks promising!



Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>

2016-09-05 16:57 GMT+02:00 Alonso Isidoro Roman <alons...@gmail.com>:

> Hi Mitch,
>
>
>1. What do you mean "my own rating" here? You know the products. So
>what Amazon is going to provide by way of Kafka?
>
> The idea was to embed the functionality of a kafka producer within a rest
> service in order i can invoke this logic with my a rating. I did not create
> such functionality because i started to make another things, i get bored,
> basically. I created some unix commands with this code, using sbt-pack.
>
>
>
>1. Assuming that you have created topic specifically for this purpose
>then that topic is streamed into Kafka, some algorithms is applied and
>results are saved in DB
>
> you got it!
>
>1. You have some dashboard that will fetch data (via ???) from the DB
>and I guess Zeppelin can do it here?
>
>
> No, not any dashboard yet. Maybe it is not a good idea to connect mongodb
> with a dashboard through a web socket. It probably works, for a proof of
> concept, but, in a real project? i don't know yet...
>
> You can see what i did to push data within a kafka topic in this scala
> class
> <https://github.com/alonsoir/awesome-recommendation-engine/blob/master/src/main/scala/example/producer/AmazonProducerExample.scala>,
> you have to invoke pack within the scala shell to create this unix command.
>
> Regards!
>
> Alonso
>
>
>
> Alonso Isidoro Roman
> [image: https://]about.me/alonso.isidoro.roman
>
> <https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>
>
> 2016-09-05 16:39 GMT+02:00 Mich Talebzadeh <mich.talebza...@gmail.com>:
>
>> Thank you Alonso,
>>
>> I looked at your project. Interesting
>>
>> As I see it this is what you are suggesting
>>
>>
>>1. A kafka producer is going to ask periodically to Amazon in order
>>to know what products based on my own ratings and i am going to introduced
>>them into some kafka topic.
>>2. A spark streaming process is going to read from that previous
>>topic.
>>3. Apply some machine learning algorithms (ALS, content based
>>filtering colaborative filtering) on those datasets readed by the spark
>>streaming process.
>>4. Save results in a mongo or cassandra instance.
>>5. Use play framework to create an websocket interface between the
>>mongo instance and the visual interface.
>>
>>
>> As I understand
>>
>> Point 1: A kafka producer is going to ask periodically to Amazon in order
>> to know what products based on my own ratings .
>>
>>
>>1. What do you mean "my own rating" here? You know the products. So
>>what Amazon is going to provide by way of Kafka?
>>2. Assuming that you have created topic specifically for this purpose
>>then that topic is streamed into Kafka, some algorithms is applied and
>>results are saved in DB
>>3. You have some dashboard that will fetch data (via ???) from the DB
>>and I guess Zeppelin can do it here?
>>
>>
>> Do you Have a DFD diagram for your design in case. Something like below
>> (hope does not look pedantic, non intended).
>>
>> [image: Inline images 1]
>>
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 5 September 2016 at 15:08, Alonso Isidoro Roman <alons...@gmail.com>
>> wrote:
>>
>>> Hi Mitch, i wrote few months ago a tiny project with this issue in mind.
>>> The idea is to apply ALS algorithm in order to get some valid
>>> recommendations from another users.
>>>
>>>
>>> The url of

Re: Real Time Recommendation Engines with Spark and Scala

2016-09-05 Thread Alonso Isidoro Roman
Hi Mitch,


   1. What do you mean "my own rating" here? You know the products. So what
   Amazon is going to provide by way of Kafka?

The idea was to embed the functionality of a kafka producer within a rest
service in order i can invoke this logic with my a rating. I did not create
such functionality because i started to make another things, i get bored,
basically. I created some unix commands with this code, using sbt-pack.



   1. Assuming that you have created topic specifically for this purpose
   then that topic is streamed into Kafka, some algorithms is applied and
   results are saved in DB

you got it!

   1. You have some dashboard that will fetch data (via ???) from the DB
   and I guess Zeppelin can do it here?


No, not any dashboard yet. Maybe it is not a good idea to connect mongodb
with a dashboard through a web socket. It probably works, for a proof of
concept, but, in a real project? i don't know yet...

You can see what i did to push data within a kafka topic in this scala class
<https://github.com/alonsoir/awesome-recommendation-engine/blob/master/src/main/scala/example/producer/AmazonProducerExample.scala>,
you have to invoke pack within the scala shell to create this unix command.

Regards!

Alonso



Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>

2016-09-05 16:39 GMT+02:00 Mich Talebzadeh <mich.talebza...@gmail.com>:

> Thank you Alonso,
>
> I looked at your project. Interesting
>
> As I see it this is what you are suggesting
>
>
>1. A kafka producer is going to ask periodically to Amazon in order to
>know what products based on my own ratings and i am going to introduced
>them into some kafka topic.
>2. A spark streaming process is going to read from that previous topic.
>3. Apply some machine learning algorithms (ALS, content based
>filtering colaborative filtering) on those datasets readed by the spark
>streaming process.
>4. Save results in a mongo or cassandra instance.
>5. Use play framework to create an websocket interface between the
>mongo instance and the visual interface.
>
>
> As I understand
>
> Point 1: A kafka producer is going to ask periodically to Amazon in order
> to know what products based on my own ratings .
>
>
>1. What do you mean "my own rating" here? You know the products. So
>what Amazon is going to provide by way of Kafka?
>2. Assuming that you have created topic specifically for this purpose
>then that topic is streamed into Kafka, some algorithms is applied and
>results are saved in DB
>3. You have some dashboard that will fetch data (via ???) from the DB
>and I guess Zeppelin can do it here?
>
>
> Do you Have a DFD diagram for your design in case. Something like below
> (hope does not look pedantic, non intended).
>
> [image: Inline images 1]
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 5 September 2016 at 15:08, Alonso Isidoro Roman <alons...@gmail.com>
> wrote:
>
>> Hi Mitch, i wrote few months ago a tiny project with this issue in mind.
>> The idea is to apply ALS algorithm in order to get some valid
>> recommendations from another users.
>>
>>
>> The url of the project
>> <https://github.com/alonsoir/awesome-recommendation-engine>
>>
>>
>>
>> Alonso Isidoro Roman
>> [image: https://]about.me/alonso.isidoro.roman
>>
>> <https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>
>>
>> 2016-09-05 15:41 GMT+02:00 Mich Talebzadeh <mich.talebza...@gmail.com>:
>>
>>> Hi,
>>>
>>> Has anyone done any work on Real time recommendation engines with Spark
>>> and Scala.
>>>
>>> I have seen few PPTs with Python but wanted to see if these have been
>>> done with Scala.
>>>
>>> I trust this question makes sense.
>>>
>>> Thanks
>>>
>>> p.s. My prime inter

Re: Real Time Recommendation Engines with Spark and Scala

2016-09-05 Thread Alonso Isidoro Roman
Hi Mitch, i wrote few months ago a tiny project with this issue in mind.
The idea is to apply ALS algorithm in order to get some valid
recommendations from another users.


The url of the project
<https://github.com/alonsoir/awesome-recommendation-engine>



Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>

2016-09-05 15:41 GMT+02:00 Mich Talebzadeh <mich.talebza...@gmail.com>:

> Hi,
>
> Has anyone done any work on Real time recommendation engines with Spark
> and Scala.
>
> I have seen few PPTs with Python but wanted to see if these have been done
> with Scala.
>
> I trust this question makes sense.
>
> Thanks
>
> p.s. My prime interest would be in Financial markets.
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


Re: Design patterns involving Spark

2016-08-30 Thread Alonso Isidoro Roman
Thanks Mitch, i will check it.

Cheers


Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>

2016-08-30 9:52 GMT+02:00 Mich Talebzadeh <mich.talebza...@gmail.com>:

> You can use Hbase for building real time dashboards
>
> Check this link
> <https://www.sigmoid.com/integrating-spark-kafka-hbase-to-power-a-real-time-dashboard/>
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 30 August 2016 at 08:33, Alonso Isidoro Roman <alons...@gmail.com>
> wrote:
>
>> HBase for real time queries? HBase was designed with the batch in mind.
>> Impala should be a best choice, but i do not know what Druid can do
>>
>>
>> Cheers
>>
>> Alonso Isidoro Roman
>> [image: https://]about.me/alonso.isidoro.roman
>>
>> <https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>
>>
>> 2016-08-30 8:56 GMT+02:00 Mich Talebzadeh <mich.talebza...@gmail.com>:
>>
>>> Hi Chanh,
>>>
>>> Druid sounds like a good choice.
>>>
>>> But again the point being is that what else Druid brings on top of
>>> Hbase.
>>>
>>> Unless one decides to use Druid for both historical data and real time
>>> data in place of Hbase!
>>>
>>> It is easier to write API against Druid that Hbase? You still want a UI
>>> dashboard?
>>>
>>> Cheers
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 30 August 2016 at 03:19, Chanh Le <giaosu...@gmail.com> wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> Seems a lot people using Druid for realtime Dashboard.
>>>> I’m just wondering of using Druid for main storage engine because Druid
>>>> can store the raw data and can integrate with Spark also (theoretical).
>>>> In that case do we need to store 2 separate storage Druid (store
>>>> segment in HDFS) and HDFS?.
>>>> BTW did anyone try this one https://github.com/Sparkli
>>>> neData/spark-druid-olap?
>>>>
>>>>
>>>> Regards,
>>>> Chanh
>>>>
>>>>
>>>> On Aug 30, 2016, at 3:23 AM, Mich Talebzadeh <mich.talebza...@gmail.com>
>>>> wrote:
>>>>
>>>> Thanks Bhaarat and everyone.
>>>>
>>>> This is an updated version of the same diagram
>>>>
>>>> 
>>>> ​​​
>>>> The frequency of Recent data is defined by the Windows length in Spark
>>>> Streaming. It can vary between 0.5 seconds to an hour. ( Don't think we can
>>>> move any Spark granularity below 0.5 seconds in anger. For some
>>>> applications like Credit card transactions and fraud detection. Data is
>>>> stored real time by Spark in Hbase tables. Hbase tables will be on HDFS as
>>>> well. The same Spark Streaming will write asynchronously to HDFS Hive
>>>> tables.
>>>> One school of thought is never write to Hive from Spark, write
>>>>  straight to Hbase and then read Hbase tables into Hive periodically?
>>>>
>>>> Now the thir

Re: Design patterns involving Spark

2016-08-30 Thread Alonso Isidoro Roman
HBase for real time queries? HBase was designed with the batch in mind.
Impala should be a best choice, but i do not know what Druid can do


Cheers

Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>

2016-08-30 8:56 GMT+02:00 Mich Talebzadeh <mich.talebza...@gmail.com>:

> Hi Chanh,
>
> Druid sounds like a good choice.
>
> But again the point being is that what else Druid brings on top of Hbase.
>
> Unless one decides to use Druid for both historical data and real time
> data in place of Hbase!
>
> It is easier to write API against Druid that Hbase? You still want a UI
> dashboard?
>
> Cheers
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 30 August 2016 at 03:19, Chanh Le <giaosu...@gmail.com> wrote:
>
>> Hi everyone,
>>
>> Seems a lot people using Druid for realtime Dashboard.
>> I’m just wondering of using Druid for main storage engine because Druid
>> can store the raw data and can integrate with Spark also (theoretical).
>> In that case do we need to store 2 separate storage Druid (store segment
>> in HDFS) and HDFS?.
>> BTW did anyone try this one https://github.com/Sparkli
>> neData/spark-druid-olap?
>>
>>
>> Regards,
>> Chanh
>>
>>
>> On Aug 30, 2016, at 3:23 AM, Mich Talebzadeh <mich.talebza...@gmail.com>
>> wrote:
>>
>> Thanks Bhaarat and everyone.
>>
>> This is an updated version of the same diagram
>>
>> 
>> ​​​
>> The frequency of Recent data is defined by the Windows length in Spark
>> Streaming. It can vary between 0.5 seconds to an hour. ( Don't think we can
>> move any Spark granularity below 0.5 seconds in anger. For some
>> applications like Credit card transactions and fraud detection. Data is
>> stored real time by Spark in Hbase tables. Hbase tables will be on HDFS as
>> well. The same Spark Streaming will write asynchronously to HDFS Hive
>> tables.
>> One school of thought is never write to Hive from Spark, write  straight
>> to Hbase and then read Hbase tables into Hive periodically?
>>
>> Now the third component in this layer is Serving Layer that can combine
>> data from the current (Hbase) and the historical (Hive tables) to give the
>> user visual analytics. Now that visual analytics can be Real time dashboard
>> on top of Serving Layer. That Serving layer could be an in-memory NoSQL
>> offering or Data from Hbase (Red Box) combined with Hive tables.
>>
>> I am not aware of any industrial strength Real time Dashboard.  The idea
>> is that one uses such dashboard in real time. Dashboard in this sense
>> meaning a general purpose API to data store of some type like on Serving
>> layer to provide visual analytics real time on demand, combining real time
>> data and aggregate views. As usual the devil in the detail.
>>
>>
>>
>> Let me know your thoughts. Anyway this is first cut pattern.
>>
>> ​​
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 29 August 2016 at 18:53, Bhaarat Sharma <bhaara...@gmail.com> wrote:
>>
>>> Hi Mich
>>>
>>> This is really helpful. I'm trying to wrap my head around the last
>>> diagram you shared (the one with kafka). In this diagram spark streaming is
>>> pushing data to HDFS an

Re: Can I control the execution of Spark jobs?

2016-06-16 Thread Alonso Isidoro Roman
Hi Wang,

maybe you can consider to use an integration framework like Apache Camel in
order to run differents jobs...

Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>

2016-06-16 13:08 GMT+02:00 Jacek Laskowski <ja...@japila.pl>:

> Hi,
>
> When you say "several ETL types of things", what is this exactly? What
> would an example of "dependency between these jobs" be?
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Thu, Jun 16, 2016 at 11:36 AM, Haopu Wang <hw...@qilinsoft.com> wrote:
> > Hi,
> >
> >
> >
> > Suppose I have a spark application which is doing several ETL types of
> > things.
> >
> > I understand Spark can analyze and generate several jobs to execute.
> >
> > The question is: is it possible to control the dependency between these
> > jobs?
> >
> >
> >
> > Thanks!
> >
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Should I avoid "state" in an Spark application?

2016-06-13 Thread Alonso Isidoro Roman
Hi Haopu, please check these threads:

http://stackoverflow.com/questions/24331815/spark-streaming-historical-state

https://databricks.gitbooks.io/databricks-spark-reference-applications/content/logs_analyzer/chapter1/total.html

Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>

2016-06-13 3:11 GMT+02:00 Haopu Wang <hw...@qilinsoft.com>:

> Can someone look at my questions? Thanks again!
>
>
> --
>
> *From:* Haopu Wang
> *Sent:* 2016年6月12日 16:40
> *To:* user@spark.apache.org
> *Subject:* Should I avoid "state" in an Spark application?
>
>
>
> I have a Spark application whose structure is below:
>
>
>
> var ts: Long = 0L
>
> dstream1.foreachRDD{
>
> (x, time) => {
>
> ts = time
>
> x.do_something()...
>
> }
>
> }
>
> ..
>
> process_data(dstream2, ts, ..)
>
>
>
> I assume foreachRDD function call can update "ts" variable which is then
> used in the Spark tasks of "process_data" function.
>
>
>
> From my test result of a standalone Spark cluster, it is working. But
> should I concern if switch to YARN?
>
>
>
> And I saw some articles are recommending to avoid state in Scala
> programming. Without the state variable, how could that be done?
>
>
>
> Any comments or suggestions are appreciated.
>
>
>
> Thanks,
>
> Haopu
>


Re: Spark Getting data from MongoDB in JAVA

2016-06-10 Thread Alonso Isidoro Roman
why *spark-mongodb_2.11 dependency is written twice in pom.xml?*

Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>

2016-06-10 11:39 GMT+02:00 Asfandyar Ashraf Malik <asfand...@kreditech.com>:

> Hi,
> I am using Stratio library to get MongoDB to work with Spark but I get the
> following error:
>
> java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.ScalaReflection
>
> This is my code.
>
> ---
> *public static void main(String[] args) {*
>
> *JavaSparkContext sc = new JavaSparkContext("local[*]", "test
> spark-mongodb java"); *
> *SQLContext sqlContext = new SQLContext(sc); *
>
> *Map options = new HashMap(); *
> *options.put("host", "xyz.mongolab.com:59107
> <http://xyz.mongolab.com:59107>"); *
> *options.put("database", "heroku_app3525385");*
> *options.put("collection", "datalog");*
> *options.put("credentials", "*,,");*
>
> *DataFrame df =
> sqlContext.read().format("com.stratio.datasource.mongodb").options(options).load();*
> *df.registerTempTable("datalog"); *
> *df.show();*
>
> *}*
>
> ---
> My pom file is as follows:
>
>  **
> **
> *org.apache.spark*
> *spark-core_2.11*
> *${spark.version}*
> **
> **
> *org.apache.spark*
> *spark-catalyst_2.11 *
> *${spark.version}*
> **
> **
> *org.apache.spark*
> *spark-sql_2.11*
> *${spark.version}*
> * *
> **
> *com.stratio.datasource*
> *spark-mongodb_2.11*
> *0.10.3*
> **
> **
> *com.stratio.datasource*
> *spark-mongodb_2.11*
> *0.10.3*
> *jar*
> **
> **
>
>
> Regards
>


Re: About a problem running a spark job in a cdh-5.7.0 vmware image.

2016-06-06 Thread Alonso Isidoro Roman
Hi, just to update the thread, i have just submited a simple wordcount job
using yarn using this command:

[cloudera@quickstart simple-word-count]$ spark-submit --class
com.example.Hello --master yarn --deploy-mode cluster --driver-memory
1024Mb --executor-memory 1G --executor-cores 1
target/scala-2.10/test_2.10-1.0.jar

and the process was submited to the cluster and finalized fine, i can see
the correct output. Now i know that the previous process havent enough
resources. Now it is a matter of tuning the process...

Running free command outputs this:


[cloudera@quickstart simple-word-count]$ free
 total   used   free sharedbuffers cached
Mem:   806110466870441374060   3464   5796 484416
-/+ buffers/cache:61968321864272
Swap:  8388604 6875007701104

so, i can only use at least 1GB...


Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>

2016-06-06 12:03 GMT+02:00 Mich Talebzadeh <mich.talebza...@gmail.com>:

> have you tried master local that should work. This works as a test
>
> ${SPARK_HOME}/bin/spark-submit \
>  --driver-memory 2G \
> --num-executors 1 \
> --executor-memory 2G \
> --master local[2] \
> --executor-cores 2 \
> --conf "spark.scheduler.mode=FAIR" \
> --conf
> "spark.executor.extraJavaOptions=-XX:+PrintGCDetails
> -XX:+PrintGCTimeStamps" \
> --jars
> /home/hduser/jars/spark-streaming-kafka-assembly_2.10-1.6.1.jar \
> --class
> "com.databricks.apps.twitter_classifier.${FILE_NAME}" \
> --conf "spark.ui.port=${SP}" \
> --conf "spark.kryoserializer.buffer.max=512" \
> ${JAR_FILE} \
> ${OUTPUT_DIRECTORY:-/tmp/tweets} \
> ${NUM_TWEETS_TO_COLLECT:-1} \
> ${OUTPUT_FILE_INTERVAL_IN_SECS:-10} \
> ${OUTPUT_FILE_PARTITIONS_EACH_INTERVAL:-1} \
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 6 June 2016 at 10:28, Alonso Isidoro Roman <alons...@gmail.com> wrote:
>
>> Hi guys, i finally understand that i cannot use sbt-pack to use
>> programmatically  the spark-streaming job as unix commands, i have to use
>> yarn or mesos  in order to run the jobs.
>>
>> I have some doubts, if i run the spark streaming jogs as yarn client
>> mode, i am receiving this exception:
>>
>> [cloudera@quickstart ~]$ spark-submit --class
>> example.spark.AmazonKafkaConnectorWithMongo --master yarn --deploy-mode
>> client --driver-memory 4g --executor-memory 2g --executor-cores 3
>> /home/cloudera/awesome-recommendation-engine/target/scala-2.10/my-recommendation-spark-engine_2.10-1.0-SNAPSHOT.jar
>> 192.168.1.35:9092 amazonRatingsTopic
>> java.lang.ClassNotFoundException:
>> example.spark.AmazonKafkaConnectorWithMongo
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>> at java.lang.Class.forName0(Native Method)
>> at java.lang.Class.forName(Class.java:270)
>> at org.apache.spark.util.Utils$.classForName(Utils.scala:175)
>> at
>> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:689)
>> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
>> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>
>>
>> But, if i use cluster mode, i have that is job is accepted.
>>
>> [cloudera@quickstart ~]$ spark-submit --class
>> example.spark.AmazonKafkaConnectorWithMongo --master yarn --deploy-mode
>> cluster --driver-memory 4g --executor-memory 2g --executor-cores 2
>> /home/cloudera/awesome-recommendation-engine/target/scala-2.10/my-recommendation-spark-engine_2.10-1.0-SNAPSHOT.jar
>> 192.168.1.35:9092 amazonRat

Re: About a problem running a spark job in a cdh-5.7.0 vmware image.

2016-06-06 Thread Alonso Isidoro Roman
rn.Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: root.cloudera
start time: 1465204607993
final status: UNDEFINED
tracking URL:
http://quickstart.cloudera:8088/proxy/application_1465201086091_0006/
user: cloudera
16/06/06 11:16:50 INFO yarn.Client: Application report for
application_1465201086091_0006 (state: ACCEPTED)
16/06/06 11:16:51 INFO yarn.Client: Application report for
application_1465201086091_0006 (state: ACCEPTED)
16/06/06 11:16:52 INFO yarn.Client: Application report for
application_1465201086091_0006 (state: ACCEPTED)
16/06/06 11:16:53 INFO yarn.Client: Application report for
application_1465201086091_0006 (state: ACCEPTED)
16/06/06 11:16:54 INFO yarn.Client: Application report for
application_1465201086091_0006 (state: ACCEPTED)
16/06/06 11:16:55 INFO yarn.Client: Application report for
application_1465201086091_0006 (state: ACCEPTED)
16/06/06 11:16:56 INFO yarn.Client: Application report for
application_1465201086091_0006 (state: ACCEPTED)
16/06/06 11:16:57 INFO yarn.Client: Application report for
application_1465201086091_0006 (state: ACCEPTED)
16/06/06 11:16:58 INFO yarn.Client: Application report for
application_1465201086091_0006 (state: ACCEPTED)
16/06/06 11:16:59 INFO yarn.Client: Application report for
application_1465201086091_0006 (state: ACCEPTED)
16/06/06 11:17:00 INFO yarn.Client: Application report for
application_1465201086091_0006 (state: ACCEPTED)
16/06/06 11:17:01 INFO yarn.Client: Application report for
application_1465201086091_0006 (state: ACCEPTED)
16/06/06 11:17:02 INFO yarn.Client: Application report for
application_1465201086091_0006 (state: ACCEPTED)
16/06/06 11:17:03 INFO yarn.Client: Application report for
application_1465201086091_0006 (state: ACCEPTED)
16/06/06 11:17:04 INFO yarn.Client: Application report for
application_1465201086091_0006 (state: ACCEPTED)
16/06/06 11:17:05 INFO yarn.Client: Application report for
application_1465201086091_0006 (state: ACCEPTED)
16/06/06 11:17:06 INFO yarn.Client: Application report for
application_1465201086091_0006 (state: ACCEPTED)
16/06/06 11:17:07 INFO yarn.Client: Application report for
application_1465201086091_0006 (state: ACCEPTED)
16/06/06 11:17:08 INFO yarn.Client: Application report for
application_1465201086091_0006 (state: ACCEPTED)
16/06/06 11:17:09 INFO yarn.Client: Application report for
application_1465201086091_0006 (state: ACCEPTED)
16/06/06 11:17:10 INFO yarn.Client: Application report for
application_1465201086091_0006 (state: ACCEPTED)
16/06/06 11:17:11 INFO yarn.Client: Application report for
application_1465201086091_0006 (state: ACCEPTED)
16/06/06 11:17:12 INFO yarn.Client: Application report for
application_1465201086091_0006 (state: ACCEPTED)
16/06/06 11:17:13 INFO yarn.Client: Application report for
application_1465201086091_0006 (state: ACCEPTED)
16/06/06 11:17:14 INFO yarn.Client: Application report for
application_1465201086091_0006 (state: ACCEPTED)
16/06/06 11:17:15 INFO yarn.Client: Application report for
application_1465201086091_0006 (state: ACCEPTED)
16/06/06 11:17:16 INFO yarn.Client: Application report for
application_1465201086091_0006 (state: ACCEPTED)
16/06/06 11:17:17 INFO yarn.Client: Application report for
application_1465201086091_0006 (state: ACCEPTED)
16/06/06 11:17:18 INFO yarn.Client: Application report for
application_1465201086091_0006 (state: ACCEPTED)
16/06/06 11:17:19 INFO yarn.Client: Application report for
application_1465201086091_0006 (state: ACCEPTED)
...

If i try to push a product to the kafka topic (amazonRatingsTopic), the
kafka broker is living in my host machine (192.168.1.35:9092), i cannot see
nothing in the logs. I can see in
http://quickstart.cloudera:/jobbrowser/ that the job is accepted, when
i click on the application_id, i can see this:

The application might not be running yet or there is no Node Manager or
Container available. This page will be automatically refreshed.

even if i push data into the kafka topic. Another think i have noticed is
that spark-worker is dead after a few minutes that the job is accepted, i
have to restart it manually doing sudo service spark-worker restart.

If i run jus command, i see this:

[cloudera@quickstart ~]$ jps
11904 SparkSubmit
12890 Jps
7271 sbt-launch.jar
[cloudera@quickstart ~]$

I know that sbt-launch is the sbt command running in another terminal, but,
 ¿Are NameNode processes and DataNode should not appear?

Thank you very much for reading until here.


Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>

2016-06-04 18:23 GMT+02:00 Mich Talebzadeh <mich.talebza...@gmail.com>:

> Hi,
>
> Spark works in local, standalone and yarn-client mode. Start as master =
> local. That is the simplest model.You DO not need to start
> $SPAK_HOME/sbin/start-master.sh and $SPAK_HOME/sbin/start-slaves.sh
>

Re: About a problem running a spark job in a cdh-5.7.0 vmware image.

2016-06-04 Thread Alonso Isidoro Roman
Hi David, but removing setMaster line provokes this error:

org.apache.spark.SparkException: A master URL must be set in your
configuration
at org.apache.spark.SparkContext.(SparkContext.scala:402)
at
example.spark.AmazonKafkaConnector$.main(AmazonKafkaConnectorWithMongo.scala:93)
at
example.spark.AmazonKafkaConnector.main(AmazonKafkaConnectorWithMongo.scala)




Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>

2016-06-03 18:23 GMT+02:00 David Newberger <david.newber...@wandcorp.com>:

> Alonso, I could totally be misunderstanding something or missing a piece
> of the puzzle however remove .setMaster. If you do that it will run with
> whatever the CDH VM is setup for which in the out of the box default case
> is YARN and Client.
>
> val sparkConf = new SparkConf().setAppName(“Some App thingy thing”)
>
>
>
> From the Spark 1.6.0 Scala API Documentation:
>
>
> https://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.SparkConf
>
>
>
>
> “
> Configuration for a Spark application. Used to set various Spark
> parameters as key-value pairs.
>
> Most of the time, you would create a SparkConf object with new SparkConf(),
> which will load values from any spark.* Java system properties set in
> your application as well. In this case, parameters you set directly on the
>  SparkConf object take priority over system properties.
>
> For unit tests, you can also call new SparkConf(false) to skip loading
> external settings and get the same configuration no matter what the system
> properties are.
>
> All setter methods in this class support chaining. For example, you can
> write new SparkConf().setMaster("local").setAppName("My app").
>
> Note that once a SparkConf object is passed to Spark, it is cloned and can
> no longer be modified by the user. Spark does not support modifying the
> configuration at runtime.
>
> “
>
>
>
> *David Newberger*
>
>
>
> *From:* Alonso Isidoro Roman [mailto:alons...@gmail.com]
> *Sent:* Friday, June 3, 2016 10:37 AM
> *To:* David Newberger
> *Cc:* user@spark.apache.org
> *Subject:* Re: About a problem running a spark job in a cdh-5.7.0 vmware
> image.
>
>
>
> Thank you David, so, i would have to change the way that i am creating
>  SparkConf object, isn't?
>
>
>
> I can see in this link
> <http://www.cloudera.com/documentation/enterprise/latest/topics/cdh_ig_running_spark_on_yarn.html#concept_ysw_lnp_h5>
>  that
> the way to run a spark job using YARN is using this kind of command:
>
>
>
> spark-submit --class org.apache.spark.examples.SparkPi --master yarn \
>
> --deploy-mode client SPARK_HOME/lib/spark-examples.jar 10
>
> Can i use this way programmatically? maybe changing setMaster? to something 
> like setMaster("yarn:quickstart.cloudera:8032")?
>
> I have seen the port in this guide: 
> http://www.cloudera.com/documentation/enterprise/5-6-x/topics/cdh_ig_ports_cdh5.html
>
>
>
>
>
>


Re: About a problem running a spark job in a cdh-5.7.0 vmware image.

2016-06-03 Thread Alonso Isidoro Roman
Thank you David, so, i would have to change the way that i am creating
 SparkConf object, isn't?

I can see in this link

that
the way to run a spark job using YARN is using this kind of command:

spark-submit --class org.apache.spark.examples.SparkPi --master yarn \
--deploy-mode client SPARK_HOME/lib/spark-examples.jar 10

Can i use this way programmatically? maybe changing setMaster? to
something like setMaster("yarn:quickstart.cloudera:8032")?

I have seen the port in this guide:
http://www.cloudera.com/documentation/enterprise/5-6-x/topics/cdh_ig_ports_cdh5.html


Re: About a problem when mapping a file located within a HDFS vmware cdh-5.7 image

2016-06-01 Thread Alonso Isidoro Roman
Thank you David, i will try to follow your advise.

Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>

2016-05-31 21:28 GMT+02:00 David Newberger <david.newber...@wandcorp.com>:

> Have you tried it without either of the setMaster lines?
>
>
> Also, CDH 5.7 uses spark 1.6.0 with some patches. I would recommend using
> the cloudera repo for spark files in build sbt. I’d also check other files
> in the build sbt to see if there are cdh specific versions.
>
>
>
> *David Newberger*
>
>
>
> *From:* Alonso Isidoro Roman [mailto:alons...@gmail.com]
> *Sent:* Tuesday, May 31, 2016 1:23 PM
> *To:* David Newberger
> *Cc:* user@spark.apache.org
> *Subject:* Re: About a problem when mapping a file located within a HDFS
> vmware cdh-5.7 image
>
>
>
> Hi David, the one of the develop branch. I think It should be the same,
> but actually not sure...
>
>
>
> Regards
>
>
> *Alonso Isidoro Roman*
>
> about.me/alonso.isidoro.roman
>
>
>
> 2016-05-31 19:40 GMT+02:00 David Newberger <david.newber...@wandcorp.com>:
>
> Is
> https://github.com/alonsoir/awesome-recommendation-engine/blob/master/build.sbt
>   the build.sbt you are using?
>
>
>
> *David Newberger*
>
> QA Analyst
>
> *WAND*  -  *The Future of Restaurant Technology*
>
> (W)  www.wandcorp.com
>
> (E)   david.newber...@wandcorp.com
>
> (P)   952.361.6200
>
>
>
> *From:* Alonso [mailto:alons...@gmail.com]
> *Sent:* Tuesday, May 31, 2016 11:11 AM
> *To:* user@spark.apache.org
> *Subject:* About a problem when mapping a file located within a HDFS
> vmware cdh-5.7 image
>
>
>
> I have a vmware cloudera image, cdh-5.7 running with centos6.8, i am using
> OS X as my development machine, and the cdh image to run the code, i upload
> the code using git to the cdh image, i have modified my /etc/hosts file
> located in the cdh image with a line like this:
>
> 127.0.0.1   quickstart.cloudera quickstart  localhost   
> localhost.domain
>
>
>
> 192.168.30.138   quickstart.cloudera quickstart  localhost   
> localhost.domain
>
> The cloudera version that i am running is:
>
> [cloudera@quickstart bin]$ cat /usr/lib/hadoop/cloudera/cdh_version.properties
>
>
>
> # Autogenerated build properties
>
> version=2.6.0-cdh5.7.0
>
> git.hash=c00978c67b0d3fe9f3b896b5030741bd40bf541a
>
> cloudera.hash=c00978c67b0d3fe9f3b896b5030741bd40bf541a
>
> cloudera.cdh.hash=e7465a27c5da4ceee397421b89e924e67bc3cbe1
>
> cloudera.cdh-packaging.hash=8f9a1632ebfb9da946f7d8a3a8cf86efcdccec76
>
> cloudera.base-branch=cdh5-base-2.6.0
>
> cloudera.build-branch=cdh5-2.6.0_5.7.0
>
> cloudera.pkg.version=2.6.0+cdh5.7.0+1280
>
> cloudera.pkg.release=1.cdh5.7.0.p0.92
>
> cloudera.cdh.release=cdh5.7.0
>
> cloudera.build.time=2016.03.23-18:30:29GMT
>
> I can do a ls command in the vmware machine:
>
> [cloudera@quickstart ~]$ hdfs dfs -ls /user/cloudera/ratings.csv
>
> -rw-r--r-- 1 cloudera cloudera 16906296 2016-05-30 11:29 
> /user/cloudera/ratings.csv
>
> I can read its content:
>
> [cloudera@quickstart ~]$ hdfs dfs -cat /user/cloudera/ratings.csv | wc -l
>
> 568454
>
> The code is quite simple, just trying to map its content:
>
> val ratingFile="hdfs://quickstart.cloudera:8020/user/cloudera/ratings.csv"
>
>
>
> case class AmazonRating(userId: String, productId: String, rating: Double)
>
>
>
> val NumRecommendations = 10
>
> val MinRecommendationsPerUser = 10
>
> val MaxRecommendationsPerUser = 20
>
> val MyUsername = "myself"
>
> val NumPartitions = 20
>
>
>
>
>
> println("Using this ratingFile: " + ratingFile)
>
>   // first create an RDD out of the rating file
>
> val rawTrainingRatings = sc.textFile(ratingFile).map {
>
> line =>
>
>   val Array(userId, productId, scoreStr) = line.split(",")
>
>   AmazonRating(userId, productId, scoreStr.toDouble)
>
> }
>
>
>
>   // only keep users that have rated between MinRecommendationsPerUser and 
> MaxRecommendationsPerUser products
>
> val trainingRatings = rawTrainingRatings.groupBy(_.userId).filter(r => 
> MinRecommendationsPerUser <= r._2.size  && r._2.size < 
> MaxRecommendationsPerUser).flatMap(_._2).repartition(NumPartitions).cache()
>
>
>
> println(s"Parsed $ratingFile. Kept ${trainingRatings.count()} ratings out of 
> ${rawTrainingRatings.count()}")
>
> I am getting this message:
>
&g

Re: About a problem when mapping a file located within a HDFS vmware cdh-5.7 image

2016-05-31 Thread Alonso Isidoro Roman
Hi David, the one of the develop branch. I think It should be the same, but
actually not sure...

Regards

Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>

2016-05-31 19:40 GMT+02:00 David Newberger <david.newber...@wandcorp.com>:

> Is
> https://github.com/alonsoir/awesome-recommendation-engine/blob/master/build.sbt
>   the build.sbt you are using?
>
>
>
> *David Newberger*
>
> QA Analyst
>
> *WAND*  -  *The Future of Restaurant Technology*
>
> (W)  www.wandcorp.com
>
> (E)   david.newber...@wandcorp.com
>
> (P)   952.361.6200
>
>
>
> *From:* Alonso [mailto:alons...@gmail.com]
> *Sent:* Tuesday, May 31, 2016 11:11 AM
> *To:* user@spark.apache.org
> *Subject:* About a problem when mapping a file located within a HDFS
> vmware cdh-5.7 image
>
>
>
> I have a vmware cloudera image, cdh-5.7 running with centos6.8, i am using
> OS X as my development machine, and the cdh image to run the code, i upload
> the code using git to the cdh image, i have modified my /etc/hosts file
> located in the cdh image with a line like this:
>
> 127.0.0.1   quickstart.cloudera quickstart  localhost   
> localhost.domain
>
>
>
> 192.168.30.138   quickstart.cloudera quickstart  localhost   
> localhost.domain
>
> The cloudera version that i am running is:
>
> [cloudera@quickstart bin]$ cat /usr/lib/hadoop/cloudera/cdh_version.properties
>
>
>
> # Autogenerated build properties
>
> version=2.6.0-cdh5.7.0
>
> git.hash=c00978c67b0d3fe9f3b896b5030741bd40bf541a
>
> cloudera.hash=c00978c67b0d3fe9f3b896b5030741bd40bf541a
>
> cloudera.cdh.hash=e7465a27c5da4ceee397421b89e924e67bc3cbe1
>
> cloudera.cdh-packaging.hash=8f9a1632ebfb9da946f7d8a3a8cf86efcdccec76
>
> cloudera.base-branch=cdh5-base-2.6.0
>
> cloudera.build-branch=cdh5-2.6.0_5.7.0
>
> cloudera.pkg.version=2.6.0+cdh5.7.0+1280
>
> cloudera.pkg.release=1.cdh5.7.0.p0.92
>
> cloudera.cdh.release=cdh5.7.0
>
> cloudera.build.time=2016.03.23-18:30:29GMT
>
> I can do a ls command in the vmware machine:
>
> [cloudera@quickstart ~]$ hdfs dfs -ls /user/cloudera/ratings.csv
>
> -rw-r--r-- 1 cloudera cloudera 16906296 2016-05-30 11:29 
> /user/cloudera/ratings.csv
>
> I can read its content:
>
> [cloudera@quickstart ~]$ hdfs dfs -cat /user/cloudera/ratings.csv | wc -l
>
> 568454
>
> The code is quite simple, just trying to map its content:
>
> val ratingFile="hdfs://quickstart.cloudera:8020/user/cloudera/ratings.csv"
>
>
>
> case class AmazonRating(userId: String, productId: String, rating: Double)
>
>
>
> val NumRecommendations = 10
>
> val MinRecommendationsPerUser = 10
>
> val MaxRecommendationsPerUser = 20
>
> val MyUsername = "myself"
>
> val NumPartitions = 20
>
>
>
>
>
> println("Using this ratingFile: " + ratingFile)
>
>   // first create an RDD out of the rating file
>
> val rawTrainingRatings = sc.textFile(ratingFile).map {
>
> line =>
>
>   val Array(userId, productId, scoreStr) = line.split(",")
>
>   AmazonRating(userId, productId, scoreStr.toDouble)
>
> }
>
>
>
>   // only keep users that have rated between MinRecommendationsPerUser and 
> MaxRecommendationsPerUser products
>
> val trainingRatings = rawTrainingRatings.groupBy(_.userId).filter(r => 
> MinRecommendationsPerUser <= r._2.size  && r._2.size < 
> MaxRecommendationsPerUser).flatMap(_._2).repartition(NumPartitions).cache()
>
>
>
> println(s"Parsed $ratingFile. Kept ${trainingRatings.count()} ratings out of 
> ${rawTrainingRatings.count()}")
>
> I am getting this message:
>
> Parsed hdfs://quickstart.cloudera:8020/user/cloudera/ratings.csv. Kept 0 
> ratings out of 568454
>
> because if i run the exact code within the spark-shell, i got this message:
>
> Parsed hdfs://quickstart.cloudera:8020/user/cloudera/ratings.csv. Kept 73279 
> ratings out of 568454
>
> *Why is it working fine within the spark-shell but it is not running
> fine programmatically  in the vmware image?*
>
> I am running the code using sbt-pack plugin to generate unix commands and
> run them within the vmware image which has the spark pseudocluster,
>
> This is the code i use to instantiate the sparkconf:
>
> val sparkConf = new SparkConf().setAppName("AmazonKafkaConnector")
>
>
> .setMaster("local[4]").set("spark.driver.allowMultipleContexts", "true")
&g

Re: Spark + Kafka processing trouble

2016-05-31 Thread Alonso Isidoro Roman
Mich`s idea is quite fine, if i was you, i will follow his idea...

Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>

2016-05-31 6:37 GMT+02:00 Mich Talebzadeh <mich.talebza...@gmail.com>:

> how are you getting your data from the database. Are you using JDBC.
>
> Can you actually call the database first (assuming the same data, put it
> in temp table in Spark and cache it for the duration of windows length and
> use the data from the cached table?
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 31 May 2016 at 04:19, Malcolm Lockyer <malcolm.lock...@hapara.com>
> wrote:
>
>> On Tue, May 31, 2016 at 3:14 PM, Darren Govoni <dar...@ontrenet.com>
>> wrote:
>> > Well that could be the problem. A SQL database is essential a big
>> > synchronizer. If you have a lot of spark tasks all bottlenecking on a
>> single
>> > database socket (is the database clustered or colocated with spark
>> workers?)
>> > then you will have blocked threads on the database server.
>>
>> Totally agree this could be a big killer to scaling up, we are
>> planning to migrate. But in the meantime we are seeing such big issues
>> with test data of only a few records (1, 2, 1024 etc.) produced to
>> Kafka. Currently the database is NOT busy (CPU, memory and IO usage
>> from the DB is tiny).
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: Logistic Regression in Spark Streaming

2016-05-27 Thread Alonso Isidoro Roman
I do not have any experience using LR in spark, but you can see that LR is
already implemented in mllib.

http://spark.apache.org/docs/latest/mllib-linear-methods.html



Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>

2016-05-27 9:09 GMT+02:00 kundan kumar <iitr.kun...@gmail.com>:

> Hi ,
>
> Do we have a streaming version of Logistic Regression in Spark ? I can see
> its there for the Linear Regression.
>
> Has anyone used logistic regression on streaming data, it would be really
> helpful if you share your insights on how to train the incoming data.
>
> In my use case I am trying to use logistic regression for click through
> rate prediction using spark. Reason to go for online streaming mode is we
> have new advertisers and items coming and old items leaving.
>
> Any insights would be helpful.
>
>
> Regards,
> Kundan
>
>


Re: about an exception when receiving data from kafka topic using Direct mode of Spark Streaming

2016-05-26 Thread Alonso Isidoro Roman
Thank you Cody, i will try to follow your advice.

Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>

2016-05-26 17:00 GMT+02:00 Cody Koeninger <c...@koeninger.org>:

> Honestly given this thread, and the stack overflow thread, I'd say you
> need to back up, start very simply, and learn spark.  If for some reason
> the official docs aren't doing it for you, learning spark from oreilly is a
> good book.
>
> Given your specific question, why not just
>
> messages.foreachRDD { rdd =>
> rdd.foreachPartition { iterator =>
>   someWorkOnAnIterator(iterator)
>
>
> All of the other extraneous stuff you're doing doesn't make any sense to
> me.
>
>
>
> On Thu, May 26, 2016 at 2:48 AM, Alonso Isidoro Roman <alons...@gmail.com>
> wrote:
>
>> Hi Matthias and Cody,
>>
>> You can see in the code that StreamingContext.start() is called after the
>> messages.foreachRDD output action. Another problem @Cody is how can i avoid
>> the inner .foreachRDD(_.foreachPartition(it =>
>> recommender.predictWithALS(it.toSeq))) in order to invoke asynchronously
>> recommender.predictWithALS which runs a machine learning ALS implementation
>> with a message from the kafka topic?.
>>
>> In the actual code i am not using for now any code to save data within
>> the mongo instance, for now, it is more important to be focus in how to
>> receive the message from the kafka topic and feeding asynchronously the ALS
>> implementation. Probably the Recommender object will need the code for
>>  interact with the mongo instance.
>>
>> The idea of the process is to receive data from the kafka topic,
>> calculate its recommendations based on the incoming message and save the
>> results within a mongo instance. Is it possible?  Am i missing something
>> important?
>>
>> def main(args: Array[String]) {
>> // Process program arguments and set properties
>>
>> if (args.length < 2) {
>>   System.err.println("Usage: " + this.getClass.getSimpleName + "
>>  ")
>>   System.exit(1)
>> }
>>
>> val Array(brokers, topics) = args
>>
>> println("Initializing Streaming Spark Context and kafka connector...")
>> // Create context with 2 second batch interval
>> val sparkConf = new SparkConf().setAppName("AmazonKafkaConnector")
>>.setMaster("local[4]")
>>
>> .set("spark.driver.allowMultipleContexts", "true")
>>
>> val sc = new SparkContext(sparkConf)
>>
>> sc.addJar("target/scala-2.10/blog-spark-recommendation_2.10-1.0-SNAPSHOT.jar")
>> val ssc = new StreamingContext(sparkConf, Seconds(2))
>> //this checkpointdir should be in a conf file, for now it is
>> hardcoded!
>> val streamingCheckpointDir =
>> "/Users/aironman/my-recommendation-spark-engine/checkpoint"
>> ssc.checkpoint(streamingCheckpointDir)
>>
>> // Create direct kafka stream with brokers and topics
>> val topicsSet = topics.split(",").toSet
>> val kafkaParams = Map[String, String]("metadata.broker.list" ->
>> brokers)
>> val messages = KafkaUtils.createDirectStream[String, String,
>> StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)
>> println("Initialized Streaming Spark Context and kafka connector...")
>>
>> //create recomendation module
>> println("Creating rating recommender module...")
>> val ratingFile= "ratings.csv"
>> val recommender = new Recommender(sc,ratingFile)
>> println("Initialized rating recommender module...")
>>
>> //i have to convert messages which is a InputDStream into a
>> Seq[AmazonRating]...
>> try{
>> messages.foreachRDD( rdd =>{
>>   val count = rdd.count()
>>   if (count > 0){
>> //someMessages should be AmazonRating...
>> val someMessages = rdd.take(count.toInt)
>> println("<-->")
>> println("someMessages is " + someMessages)
>> someMessages.foreach(println)
>> println("<-->")
>> println("<---POSSIBLE SOLUTION--->")
>>
>> messages
>> .map { case (_, jsonRating) =>
>>   val jsValue = Json.parse(jsonRating)
>>   AmazonRating.

Re: about an exception when receiving data from kafka topic using Direct mode of Spark Streaming

2016-05-26 Thread Alonso Isidoro Roman
Hi Matthias and Cody,

You can see in the code that StreamingContext.start() is called after the
messages.foreachRDD output action. Another problem @Cody is how can i avoid
the inner .foreachRDD(_.foreachPartition(it =>
recommender.predictWithALS(it.toSeq))) in order to invoke asynchronously
recommender.predictWithALS which runs a machine learning ALS implementation
with a message from the kafka topic?.

In the actual code i am not using for now any code to save data within the
mongo instance, for now, it is more important to be focus in how to receive
the message from the kafka topic and feeding asynchronously the ALS
implementation. Probably the Recommender object will need the code for
 interact with the mongo instance.

The idea of the process is to receive data from the kafka topic, calculate
its recommendations based on the incoming message and save the results
within a mongo instance. Is it possible?  Am i missing something important?

def main(args: Array[String]) {
// Process program arguments and set properties

if (args.length < 2) {
  System.err.println("Usage: " + this.getClass.getSimpleName + "
 ")
  System.exit(1)
}

val Array(brokers, topics) = args

println("Initializing Streaming Spark Context and kafka connector...")
// Create context with 2 second batch interval
val sparkConf = new SparkConf().setAppName("AmazonKafkaConnector")
   .setMaster("local[4]")

.set("spark.driver.allowMultipleContexts", "true")

val sc = new SparkContext(sparkConf)

sc.addJar("target/scala-2.10/blog-spark-recommendation_2.10-1.0-SNAPSHOT.jar")
val ssc = new StreamingContext(sparkConf, Seconds(2))
//this checkpointdir should be in a conf file, for now it is hardcoded!
val streamingCheckpointDir =
"/Users/aironman/my-recommendation-spark-engine/checkpoint"
ssc.checkpoint(streamingCheckpointDir)

// Create direct kafka stream with brokers and topics
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val messages = KafkaUtils.createDirectStream[String, String,
StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)
println("Initialized Streaming Spark Context and kafka connector...")

//create recomendation module
println("Creating rating recommender module...")
val ratingFile= "ratings.csv"
val recommender = new Recommender(sc,ratingFile)
println("Initialized rating recommender module...")

//i have to convert messages which is a InputDStream into a
Seq[AmazonRating]...
try{
messages.foreachRDD( rdd =>{
  val count = rdd.count()
  if (count > 0){
//someMessages should be AmazonRating...
val someMessages = rdd.take(count.toInt)
println("<-->")
println("someMessages is " + someMessages)
someMessages.foreach(println)
println("<-->")
println("<---POSSIBLE SOLUTION--->")

messages
.map { case (_, jsonRating) =>
  val jsValue = Json.parse(jsonRating)
  AmazonRating.amazonRatingFormat.reads(jsValue) match {
case JsSuccess(rating, _) => rating
case JsError(_) => AmazonRating.empty
  }
 }
.filter(_ != AmazonRating.empty)
//probably is not a good idea to do this...
.foreachRDD(_.foreachPartition(it =>
recommender.predictWithALS(it.toSeq)))

println("<---POSSIBLE SOLUTION--->")

  }
  }
)
}catch{
  case e: IllegalArgumentException => {println("illegal arg.
exception")};
  case e: IllegalStateException=> {println("illegal state
exception")};
  case e: ClassCastException   => {println("ClassCastException")};
  case e: Exception=> {println(" Generic Exception")};
}finally{

  println("Finished taking data from kafka topic...")
}

//println("jsonParsed is " + jsonParsed)
//The idea is to save results from Recommender.predict within mongodb,
so i will have to deal with this issue
//after resolving the issue of
.foreachRDD(_.foreachPartition(recommender.predict(_.toSeq)))

*ssc.start()*
ssc.awaitTermination()

    println("Finished!")
  }
}

Thank you for reading until here, please, i need your assistance.

Regards


Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>

2016-05-25 17:33 GMT+02:00 Alonso Isidoro Roman <alons...@gmail.com>:

> Hi Matthias and Cody, thanks for the a

Re: about an exception when receiving data from kafka topic using Direct mode of Spark Streaming

2016-05-25 Thread Alonso Isidoro Roman
)

at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)

at
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:224)

at
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:224)

at
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:224)

at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)

at
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:223)

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)



In your opinion, what changes do i have to do in order to get this code up
and running correctly?

The idea is to run every rating message that i receive from kafka topic in
order to run recommender.predictWithALS method and save results within a
mongo instance. I was thinking that this kind of task should be
asynchronous, wasn't he? if i am right, how should i change the method to
do such that way?

Recommender.predictWithALS method:

def predictWithALS(ratings: Seq[AmazonRating]) = {
// train model
val myRatings = ratings.map(toSparkRating)
val myRatingRDD = sc.parallelize(myRatings)

val startAls = DateTime.now
val model = ALS.train((sparkRatings ++
myRatingRDD).repartition(NumPartitions), 10, 20, 0.01)

val myProducts = myRatings.map(_.product).toSet
val candidates = sc.parallelize((0 until
productDict.size).filterNot(myProducts.contains))

// get ratings of all products not in my history ordered by rating
(higher first) and only keep the first NumRecommendations
val myUserId = userDict.getIndex(MyUsername)
val recommendations = model.predict(candidates.map((myUserId,
_))).collect
    val endAls = DateTime.now
val result =
recommendations.sortBy(-_.rating).take(NumRecommendations).map(toAmazonRating)
val alsTime = Seconds.secondsBetween(startAls, endAls).getSeconds

println(s"ALS Time: $alsTime seconds")
result
  }

thank you very much

Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>

2016-05-25 16:54 GMT+02:00 Matthias Niehoff <matthias.nieh...@codecentric.de
>:

> Hi,
>
> you register some output actions (in this case foreachRDD) after starting
> the streaming context. StreamingContext.start() has to be called after all!
> output actions.
>
> 2016-05-25 15:59 GMT+02:00 Alonso <alons...@gmail.com>:
>
>> Hi, i am receiving this exception when direct spark streaming process
>> tries to pull data from kafka topic:
>>
>> 16/05/25 11:30:30 INFO CheckpointWriter: Checkpoint for time
>> 146416863 ms saved to file
>> 'file:/Users/aironman/my-recommendation-spark-engine/checkpoint/checkpoint-146416863',
>> took 5928 bytes and 8 ms
>>
>> 16/05/25 11:30:30 INFO Executor: Finished task 0.0 in stage 2.0 (TID 2). 
>> 1041 bytes result sent to driver
>> 16/05/25 11:30:30 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 
>> 2) in 4 ms on localhost (1/1)
>> 16/05/25 11:30:30 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks 
>> have all completed, from pool
>> 16/05/25 11:30:30 INFO DAGScheduler: ResultStage 2 (runJob at 
>> KafkaRDD.scala:98) finished in 0,004 s
>> 16/05/25 11:30:30 INFO DAGScheduler: Job 2 finished: runJob at 
>> KafkaRDD.scala:98, took 0,008740 s
>> <-->
>> someMessages is [Lscala.Tuple2;@2641d687
>> (null,{"userId":"someUserId","productId":"0981531679","rating":6.0})
>> <-->
>> <---POSSIBLE SOLUTION--->
>> 16/05/25 11:30:30 INFO JobScheduler: Finished job streaming job 
>> 146416863 ms.0 from job set of time 146416863 ms
>> 16/05/25 11:30:30 INFO KafkaRDD: Removing RDD 105 from persistence list
>> 16/05/25 11:30:30 INFO JobScheduler: Total delay: 0,020 s for time 
>> 146416863 ms (execution: 0,012 s)
>> 16/05/25 11:30:30 ERROR JobScheduler: Error running job streaming job 
>> 146416863 ms.0*java.lang.IllegalStateException: Adding new inputs, 
>> transformations, and output operations after starting a context is not 
>> supported
>>  at* 
>> org.apache.spark.streaming.dstream.DStream.validateAtInit(DStream.scala:222)
>>  at org.apache.spark.streaming.dstream.DStream.(DStream.scala:64)
>>  at 
>> org.apache.spark.streaming.dstream.MappedDStream.(MappedDStream.scala:25)
>>  at 
>> org.apache.spark.streaming.dstream.DStream$$anonfun$map$1.apply(DStream.scala:558)
&g

about an exception when receiving data from kafka topic using Direct mode of Spark Streaming

2016-05-25 Thread Alonso Isidoro Roman
n("someMessages is " + someMessages)
someMessages.foreach(println)
println("<-->")
println("<---POSSIBLE SOLUTION--->")
messages
.map { case (_, jsonRating) =>
  val jsValue = Json.parse(jsonRating)
  AmazonRating.amazonRatingFormat.reads(jsValue) match {
case JsSuccess(rating, _) => rating
case JsError(_) => AmazonRating.empty
  }
 }
.filter(_ != AmazonRating.empty)
*//I think that this line provokes the runtime exception...*
*.foreachRDD(_.foreachPartition(it =>
recommender.predictWithALS(it.toSeq)))*

println("<---POSSIBLE SOLUTION--->")

  }
  }
)
}catch{
  case e: IllegalArgumentException => {println("illegal arg.
exception")};
  case e: IllegalStateException=> {println("illegal state
exception")};
  case e: ClassCastException   => {println("ClassCastException")};
  case e: Exception=> {println(" Generic Exception")};
}finally{

  println("Finished taking data from kafka topic...")
}

Recommender object:

*def predictWithALS(ratings: Seq[AmazonRating])* = {
// train model
val myRatings = ratings.map(toSparkRating)
val myRatingRDD = sc.parallelize(myRatings)

val startAls = DateTime.now
val model = ALS.train((sparkRatings ++
myRatingRDD).repartition(NumPartitions), 10, 20, 0.01)

val myProducts = myRatings.map(_.product).toSet
val candidates = sc.parallelize((0 until
productDict.size).filterNot(myProducts.contains))

// get ratings of all products not in my history ordered by rating
(higher first) and only keep the first NumRecommendations
val myUserId = userDict.getIndex(MyUsername)
val recommendations = model.predict(candidates.map((myUserId,
_))).collect
val endAls = DateTime.now
val result =
recommendations.sortBy(-_.rating).take(NumRecommendations).map(toAmazonRating)
val alsTime = Seconds.secondsBetween(startAls, endAls).getSeconds

println(s"ALS Time: $alsTime seconds")
result
  }
}

And this is the kafka producer that push the json data within the topic:

object AmazonProducerExample {
  def main(args: Array[String]): Unit = {

val productId = args(0).toString
val userId = args(1).toString
val rating = args(2).toDouble
val topicName = "amazonRatingsTopic"

val producer = Producer[String](topicName)

//0981531679 is Scala Puzzlers...
//AmazonProductAndRating
AmazonPageParser.parse(productId,userId,rating).onSuccess { case
amazonRating =>
  //Is this the correct way? the best performance? possibly not, what
about using avro or parquet?
  producer.send(Json.toJson(amazonRating).toString)
  //producer.send(amazonRating)
  println("amazon product with rating sent to kafka cluster..." +
amazonRating.toString)
  System.exit(0)
}

  }
}


I have written a stack overflow post
<http://stackoverflow.com/questions/37303202/about-an-error-accessing-a-field-inside-tuple2>,
with more details, please help, i am stuck with this issue and i don't know
how to continue.

Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>


Re: Spark JOIN Not working

2016-05-24 Thread Alonso Isidoro Roman
Could you share a bit of the dataset? difficult to test without data...

Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>

2016-05-24 8:43 GMT+02:00 Aakash Basu <raj2coo...@gmail.com>:

> Hi experts,
>
> I'm extremely new to the Spark Ecosystem, hence need a help from you guys.
> While trying to fetch data from CSV files and join querying them in
> accordance to the need, when I'm caching the data by using
> registeredTempTables and then using select query to select what I need as
> per the given condition, I'm getting the data. BUT when I'm trying to do
> the same query using JOIN, I'm getting zero results.
>
> Both the codes attached are same, except a few differences, like the
> Running_Code.scala uses the Select Query and the
> ProductHierarchy_Dimension.scala uses the JOIN queries.
>
> Please help me out in this. Stuck for two long days.
>
> Thanks,
> Aakash.
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>


Re: [Spark 1.5.2]Check Foreign Key constraint

2016-05-11 Thread Alonso Isidoro Roman
I think that Impala and Hive have this feature. Impala is faster than hive,
hive is probably better to use in batch mode.

Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>

2016-05-11 9:57 GMT+02:00 Divya Gehlot <divya.htco...@gmail.com>:

> Hi,
> I am using Spark 1.5.2  with Apache Phoenix 4.4
> As Spark 1.5.2 doesn't support subquery in where conditions .
> https://issues.apache.org/jira/browse/SPARK-4226
>
> Is there any alternative way to find foreign key constraints.
> Would really appreciate the help.
>
>
>
> Thanks,
> Divya
>


Re: Multiple Spark Applications that use Cassandra, how to share resources/nodes

2016-05-04 Thread Alonso Isidoro Roman
Andy, i think there are some ideas to implement a pool of spark context,
but, for now, it is only an idea.


https://github.com/spark-jobserver/spark-jobserver/issues/365


It is possible to share a spark context between apps, i did not have to use
this feature, sorry about that.

Regards,

Alonso



Alonso Isidoro Roman.

Mis citas preferidas (de hoy) :
"Si depurar es el proceso de quitar los errores de software, entonces
programar debe ser el proceso de introducirlos..."
 -  Edsger Dijkstra

My favorite quotes (today):
"If debugging is the process of removing software bugs, then programming
must be the process of putting ..."
  - Edsger Dijkstra

"If you pay peanuts you get monkeys"


2016-05-04 11:08 GMT+02:00 Tobias Eriksson <tobias.eriks...@qvantel.com>:

> Hi Andy,
>  We have a very simple approach I think, we do like this
>
>1. Submit our Spark application to the Spark Master (version 1.6.1.)
>2. Our Application creates a Spark Context that we use throughout
>3. We use Spray REST server
>4. Every request that comes in we simply serve by querying Cassandra
>doing some joins and some processing, and returns JSON as a result back on
>the REST-API
>5. We to take advantage of co-locating the Spark Workers with the
>Cassandra Nodes to “boost” performance (in our test lab we have a 4 node
>cluster)
>
> Performance wise we have had some challenges but that has had to do with
> how the data was arranged in Cassandra, after changing to the
> time-series-design-pattern we improved our performance dramatically, 750
> times in our test lab.
>
> But now the problem is that we have more Spark applications running
> concurrently/in parallell and we are then forced to scale down on the
> number of cores that OUR application can use to ensure that we give way for
> other applications to come in a “play” too. This is not optimal, cause if
> there is free resources then I would like to use them
>
> When it comes to having load balancing the REST requests, then in my case
> I will not have that many clients, yet in my case I think that I could
> scale by adding multiple instances of my Spark Applications, but would
> obviously suffer in having to share the resources between the different
> Spark Workers (say cores). Or I would have to use dynamic resourcing.
> But as I started out my question here this is where I struggle, I need to
> get this right with sharing the resources.
> This is a challenges since I rely on that I HAVE TO co-locate the Spark
> Workers and Cassandra Nodes, meaning that I can not have 3 out of 4 nodes,
> cause then the Cassandra access will not be efficient since I use
> repartitionByCassandraReplica()
>
> Satisfying 250ms requests, well that depends very much on your use case I
> would say, and boring answer :-( sorry
>
> Regards
>  Tobias
>
> From: Andy Davidson <a...@santacruzintegration.com>
> Date: Tuesday 3 May 2016 at 17:26
> To: Tobias Eriksson <tobias.eriks...@qvantel.com>, "user@spark.apache.org"
> <user@spark.apache.org>
> Subject: Re: Multiple Spark Applications that use Cassandra, how to share
> resources/nodes
>
> Hi Tobias
>
> I am very interested implemented rest based api on top of spark. My rest
> based system would make predictions from data provided in the request using
> models trained in batch. My SLA is 250 ms.
>
> Would you mind sharing how you implemented your rest server?
>
> I am using spark-1.6.1. I have several unit tests that create spark
> context, master is set to ‘local[4]’. I do not think the unit test frame is
> going to scale. Can each rest server have a pool of sparks contexts?
>
>
> The system would like to replacing is set up as follows
>
> Layer of dumb load balancers: l1, l2, l3
> Layer of proxy servers:   p1, p2, p3, p4, p5, ….. Pn
> Layer of containers:  c1, c2, c3, ….. Cn
>
> Where Cn is much larger than Pn
>
>
> Kind regards
>
> Andy
>
> P.s. There is a talk on 5/5 about spark 2.0 Hoping there is something in
> the near future.
>
> https://www.brighttalk.com/webcast/12891/202021?utm_campaign=google-calendar_content=_source=brighttalk-portal_medium=calendar_term=
>
> From: Tobias Eriksson <tobias.eriks...@qvantel.com>
> Date: Tuesday, May 3, 2016 at 7:34 AM
> To: "user @spark" <user@spark.apache.org>
> Subject: Multiple Spark Applications that use Cassandra, how to share
> resources/nodes
>
> Hi
>  We are using Spark for a long running job, in fact it is a REST-server
> that does some joins with some tables in Casandra and returns the result.
> Now we need to have multiple applications running in the same Spark
> cluster, and from what I understand t

Re: migration from Teradata to Spark SQL

2016-05-04 Thread Alonso Isidoro Roman
I agree with Deepak and i would try to save data in parquet and avro
format, if you can, try to measure the performance and choose the best, it
will probably be parquet, but you have to know for yourself.

Alonso Isidoro Roman.

Mis citas preferidas (de hoy) :
"Si depurar es el proceso de quitar los errores de software, entonces
programar debe ser el proceso de introducirlos..."
 -  Edsger Dijkstra

My favorite quotes (today):
"If debugging is the process of removing software bugs, then programming
must be the process of putting ..."
  - Edsger Dijkstra

"If you pay peanuts you get monkeys"


2016-05-04 9:22 GMT+02:00 Jörn Franke <jornfra...@gmail.com>:

> Look at lambda architecture.
>
> What is the motivation of your migration?
>
> On 04 May 2016, at 03:29, Tapan Upadhyay <tap...@gmail.com> wrote:
>
> Hi,
>
> We are planning to move our adhoc queries from teradata to spark. We have
> huge volume of queries during the day. What is best way to go about it -
>
> 1) Read data directly from teradata db using spark jdbc
>
> 2) Import data using sqoop by EOD jobs into hive tables stored as parquet
> and then run queries on hive tables using spark sql or spark hive context.
>
> any other ways through which we can do it in a better/efficiently?
>
> Please guide.
>
> Regards,
> Tapan
>
>


Re: Code optimization

2016-04-19 Thread Alonso Isidoro Roman
Hi Angel,

how about to use this :

k.filter(k("WT_ID")

as a val variable? i think you can avoid that and do not forget to use
System.nanoTime to know the profit...

Alonso Isidoro Roman.

Mis citas preferidas (de hoy) :
"Si depurar es el proceso de quitar los errores de software, entonces
programar debe ser el proceso de introducirlos..."
 -  Edsger Dijkstra

My favorite quotes (today):
"If debugging is the process of removing software bugs, then programming
must be the process of putting ..."
  - Edsger Dijkstra

"If you pay peanuts you get monkeys"


2016-04-19 9:46 GMT+02:00 Angel Angel <areyouange...@gmail.com>:

> Hello,
>
> I am writing the one spark application, it runs well but takes long
> execution time can anyone help me to optimize my query to increase the
> processing speed.
>
>
> I am writing one application in which i have to construct the histogram
> and compare the histograms in order to find the final candidate.
>
>
> My code in which i read the text file and matches the first field and
> subtract the second fild from the matched candidates and update the table.
>
> Here is my code, Please help me to optimize it.
>
>
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
>
>
> import sqlContext.implicits._
>
>
> val Array_Ele =
> sc.textFile("/root/Desktop/database_200/patch_time_All_20_modified_1.txt").flatMap(line=>line.split("
> ")).take(900)
>
>
> val df1=
> sqlContext.read.parquet("hdfs://hadoopm0:8020/tmp/input1/database_modified_No_name_400.parquet")
>
>
> var k = df1.filter(df1("Address").equalTo(Array_Ele(0) ))
>
> var a= 0
>
>
> for( a <-2 until 900 by 2){
>
> k=k.unionAll(
> df1.filter(df1("Address").equalTo(Array_Ele(a))).select(df1("Address"),df1("Couple_time")-Array_Ele(a+1),df1("WT_ID")))}
>
>
> k.cache()
>
>
> val WT_ID_Sort  = k.groupBy("WT_ID").count().sort(desc("count"))
>
>
> val temp = WT_ID_Sort.select("WT_ID").rdd.map(r=>r(0)).take(10)
>
>
> val Table0=
> k.filter(k("WT_ID").equalTo(temp(0))).groupBy("Couple_time").count().select(max($"count")).show()
>
> val Table1=
> k.filter(k("WT_ID").equalTo(temp(1))).groupBy("Couple_time").count().select(max($"count")).show()
>
> val Table2=
> k.filter(k("WT_ID").equalTo(temp(2))).groupBy("Couple_time").count().select(max($"count")).show()
>
> val Table3=
> k.filter(k("WT_ID").equalTo(temp(3))).groupBy("Couple_time").count().select(max($"count")).show()
>
> val Table4=
> k.filter(k("WT_ID").equalTo(temp(4))).groupBy("Couple_time").count().select(max($"count")).show()
>
> val Table5=
> k.filter(k("WT_ID").equalTo(temp(5))).groupBy("Couple_time").count().select(max($"count")).show()
>
> val Table6=
> k.filter(k("WT_ID").equalTo(temp(6))).groupBy("Couple_time").count().select(max($"count")).show()
>
> val Table7=
> k.filter(k("WT_ID").equalTo(temp(7))).groupBy("Couple_time").count().select(max($"count")).show()
>
> val Table8=
> k.filter(k("WT_ID").equalTo(temp(8))).groupBy("Couple_time").count().select(max($"count")).show()
>
>
>
> val Table10=
> k.filter(k("WT_ID").equalTo(temp(10))).groupBy("Couple_time").count().select(max($"count")).show()
>
>
> val Table11=
> k.filter(k("WT_ID").equalTo(temp(11))).groupBy("Couple_time").count().select(max($"count")).show()
>
>
> and last one how can i compare the all this tables to find the maximum
> value.
>
>
>
>
> Thanks,
>
>
>


Re: Run a self-contained Spark app on a Spark standalone cluster

2016-04-12 Thread Alonso Isidoro Roman
I don't know how to do it with python, but scala has a plugin named
sbt-pack that creates an auto contained unix command with your code, no
need to use spark-submit. It should be out there something similar to this
tool.



Alonso Isidoro Roman.

Mis citas preferidas (de hoy) :
"Si depurar es el proceso de quitar los errores de software, entonces
programar debe ser el proceso de introducirlos..."
 -  Edsger Dijkstra

My favorite quotes (today):
"If debugging is the process of removing software bugs, then programming
must be the process of putting ..."
  - Edsger Dijkstra

"If you pay peanuts you get monkeys"


2016-04-12 11:06 GMT+02:00 kevllino <kevin.e...@mail.dcu.ie>:

> Hi,
>
> I need to know how to run a self-contained Spark app  (3 python files) in a
> Spark standalone cluster. Can I move the .py files to the cluster, or
> should
> I store them locally, on HDFS or S3? I tried the following locally and on
> S3
> with a zip of my .py files as suggested  here
> <http://spark.apache.org/docs/latest/submitting-applications.html>  :
>
> ./bin/spark-submit --master
> spark://ec2-54-51-23-172.eu-west-1.compute.amazonaws.com:5080
> --py-files
> s3n://AWS_ACCESS_KEY_ID:AWS_SECRET_ACCESS_KEY@mubucket
> //weather_predict.zip
>
> But get: “Error: Must specify a primary resource (JAR or Python file)”
>
> Best,
> Kevin
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Run-a-self-contained-Spark-app-on-a-Spark-standalone-cluster-tp26753.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Problem with jackson lib running on spark

2016-03-31 Thread Alonso Isidoro Roman
As you can see jackson-core is provided by several libraries, try to
exclude it from spark-core, i think the minor version is included within
it.

Use this guide to see how to do it:

https://maven.apache.org/guides/introduction/introduction-to-optional-and-excludes-dependencies.html



Alonso Isidoro Roman.

Mis citas preferidas (de hoy) :
"Si depurar es el proceso de quitar los errores de software, entonces
programar debe ser el proceso de introducirlos..."
 -  Edsger Dijkstra

My favorite quotes (today):
"If debugging is the process of removing software bugs, then programming
must be the process of putting ..."
  - Edsger Dijkstra

"If you pay peanuts you get monkeys"


2016-03-31 20:01 GMT+02:00 Marcelo Oikawa <marcelo.oik...@webradar.com>:

> Hey, Alonso.
>
> here is the output:
>
> [INFO] spark-processor:spark-processor-druid:jar:1.0-SNAPSHOT
> [INFO] +- org.apache.spark:spark-streaming_2.10:jar:1.6.1:provided
> [INFO] |  +- org.apache.spark:spark-core_2.10:jar:1.6.1:provided
> [INFO] |  |  +- org.apache.avro:avro-mapred:jar:hadoop2:1.7.7:provided
> [INFO] |  |  |  +- org.apache.avro:avro-ipc:jar:1.7.7:provided
> [INFO] |  |  |  \- org.apache.avro:avro-ipc:jar:tests:1.7.7:provided
> [INFO] |  |  +- com.twitter:chill_2.10:jar:0.5.0:provided
> [INFO] |  |  |  \- com.esotericsoftware.kryo:kryo:jar:2.21:provided
> [INFO] |  |  | +-
> com.esotericsoftware.reflectasm:reflectasm:jar:shaded:1.07:provided
> [INFO] |  |  | +- com.esotericsoftware.minlog:minlog:jar:1.2:provided
> [INFO] |  |  | \- org.objenesis:objenesis:jar:1.2:provided
> [INFO] |  |  +- com.twitter:chill-java:jar:0.5.0:provided
> [INFO] |  |  +- org.apache.xbean:xbean-asm5-shaded:jar:4.4:provided
> [INFO] |  |  +- org.apache.hadoop:hadoop-client:jar:2.2.0:provided
> [INFO] |  |  |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:provided
> [INFO] |  |  |  |  +- org.apache.commons:commons-math:jar:2.1:provided
> [INFO] |  |  |  |  +- xmlenc:xmlenc:jar:0.52:provided
> [INFO] |  |  |  |  +-
> commons-configuration:commons-configuration:jar:1.6:provided
> [INFO] |  |  |  |  |  +- commons-digester:commons-digester:jar:1.8:provided
> [INFO] |  |  |  |  |  |  \-
> commons-beanutils:commons-beanutils:jar:1.7.0:provided
> [INFO] |  |  |  |  |  \-
> commons-beanutils:commons-beanutils-core:jar:1.8.0:provided
> [INFO] |  |  |  |  \- org.apache.hadoop:hadoop-auth:jar:2.2.0:provided
> [INFO] |  |  |  +- org.apache.hadoop:hadoop-hdfs:jar:2.2.0:provided
> [INFO] |  |  |  |  \- org.mortbay.jetty:jetty-util:jar:6.1.26:provided
> [INFO] |  |  |  +-
> org.apache.hadoop:hadoop-mapreduce-client-app:jar:2.2.0:provided
> [INFO] |  |  |  |  +-
> org.apache.hadoop:hadoop-mapreduce-client-common:jar:2.2.0:provided
> [INFO] |  |  |  |  |  +-
> org.apache.hadoop:hadoop-yarn-client:jar:2.2.0:provided
> [INFO] |  |  |  |  |  |  +-
> com.sun.jersey.jersey-test-framework:jersey-test-framework-grizzly2:jar:1.9:provided
> [INFO] |  |  |  |  |  |  |  +-
> com.sun.jersey.jersey-test-framework:jersey-test-framework-core:jar:1.9:provided
> [INFO] |  |  |  |  |  |  |  |  \-
> com.sun.jersey:jersey-client:jar:1.9:provided
> [INFO] |  |  |  |  |  |  |  \-
> com.sun.jersey:jersey-grizzly2:jar:1.9:provided
> [INFO] |  |  |  |  |  |  | +-
> org.glassfish.grizzly:grizzly-http:jar:2.1.2:provided
> [INFO] |  |  |  |  |  |  | |  \-
> org.glassfish.grizzly:grizzly-framework:jar:2.1.2:provided
> [INFO] |  |  |  |  |  |  | | \-
> org.glassfish.gmbal:gmbal-api-only:jar:3.0.0-b023:provided
> [INFO] |  |  |  |  |  |  | |\-
> org.glassfish.external:management-api:jar:3.0.0-b012:provided
> [INFO] |  |  |  |  |  |  | +-
> org.glassfish.grizzly:grizzly-http-server:jar:2.1.2:provided
> [INFO] |  |  |  |  |  |  | |  \-
> org.glassfish.grizzly:grizzly-rcm:jar:2.1.2:provided
> [INFO] |  |  |  |  |  |  | +-
> org.glassfish.grizzly:grizzly-http-servlet:jar:2.1.2:provided
> [INFO] |  |  |  |  |  |  | \-
> org.glassfish:javax.servlet:jar:3.1:provided
> [INFO] |  |  |  |  |  |  \- com.sun.jersey:jersey-json:jar:1.9:provided
> [INFO] |  |  |  |  |  | +-
> org.codehaus.jettison:jettison:jar:1.1:provided
> [INFO] |  |  |  |  |  | |  \- stax:stax-api:jar:1.0.1:provided
> [INFO] |  |  |  |  |  | +-
> com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:provided
> [INFO] |  |  |  |  |  | |  \-
> javax.xml.bind:jaxb-api:jar:2.2.2:provided
> [INFO] |  |  |  |  |  | | \-
> javax.activation:activation:jar:1.1:provided
> [INFO] |  |  |  |  |  | +-
> org.codehaus.jackson:jackson-jaxrs:jar:1.8.3:provided
> [INFO] |  |  |  |  |  | \-
> org.codehaus.jackson:jackson-xc:jar:1.8.3:provided
> [INFO] |  |  |  |  |  \-
> org.apache.hadoop:hadoop-yarn-server-common:jar:2.2.0:prov

Re: Problem with jackson lib running on spark

2016-03-31 Thread Alonso Isidoro Roman
Run mvn dependency:tree and print the output here, i suspect that jackson
library is included within more than one dependency.

Alonso Isidoro Roman.

Mis citas preferidas (de hoy) :
"Si depurar es el proceso de quitar los errores de software, entonces
programar debe ser el proceso de introducirlos..."
 -  Edsger Dijkstra

My favorite quotes (today):
"If debugging is the process of removing software bugs, then programming
must be the process of putting ..."
  - Edsger Dijkstra

"If you pay peanuts you get monkeys"


2016-03-31 19:21 GMT+02:00 Marcelo Oikawa <marcelo.oik...@webradar.com>:

> Hey, Ted
>
> 2.4.4
>>
>> Looks like Tranquility uses different version of jackson.
>>
>> How do you build your jar ?
>>
>
> I'm building a jar with dependencies using the maven assembly plugin.
> Below is all jackson's dependencies:
>
> [INFO]
> com.fasterxml.jackson.module:jackson-module-scala_2.10:jar:2.4.5:compile
> [INFO]com.fasterxml.jackson.core:jackson-databind:jar:2.4.6:compile
> [INFO]
> com.fasterxml.jackson.datatype:jackson-datatype-joda:jar:2.4.6:compile
> [INFO]
> com.fasterxml.jackson.dataformat:jackson-dataformat-smile:jar:2.4.6:compile
> [INFO]
> com.fasterxml.jackson.jaxrs:jackson-jaxrs-smile-provider:jar:2.4.6:compile
> [INFO]com.fasterxml.jackson.jaxrs:jackson-jaxrs-base:jar:2.4.6:compile
> [INFO]
> com.fasterxml.jackson.module:jackson-module-jaxb-annotations:jar:2.4.6:compile
> [INFO]com.fasterxml.jackson.core:jackson-core:jar:2.4.6:compile
> [INFO]
> com.fasterxml.jackson.jaxrs:jackson-jaxrs-json-provider:jar:2.4.6:compile
> [INFO]
> com.fasterxml.jackson.datatype:jackson-datatype-guava:jar:2.4.6:compile
> [INFO]com.fasterxml.jackson.core:jackson-annotations:jar:2.4.6:compile
> [INFO]org.json4s:json4s-jackson_2.10:jar:3.2.10:provided
> [INFO]org.codehaus.jackson:jackson-xc:jar:1.8.3:provided
> [INFO]org.codehaus.jackson:jackson-jaxrs:jar:1.8.3:provided
> [INFO]org.codehaus.jackson:jackson-mapper-asl:jar:1.9.13:compile
> [INFO]org.codehaus.jackson:jackson-core-asl:jar:1.9.13:compile
> [INFO]
> com.google.http-client:google-http-client-jackson2:jar:1.15.0-rc:compile
>
> As you can see, my jar requires fasterxml 2.4.6. In that case, what does
> spark do? Does it run my jar with my jackson lib (inside my jar) or uses
> the jackson version (2.4.4) used by spark?
>
> Note that one of my dependencies is:
>
> 
> org.apache.spark
> spark-streaming_2.10
> 1.6.1
> provided
> 
>
> and the jackson version 2.4.4 was not listed in maven dependencies...
>
>
>>
>> Consider using maven-shade-plugin to resolve the conflict if you use
>> maven.
>>
>> Cheers
>>
>> On Thu, Mar 31, 2016 at 9:50 AM, Marcelo Oikawa <
>> marcelo.oik...@webradar.com> wrote:
>>
>>> Hi, list.
>>>
>>> We are working on a spark application that sends messages to Druid. For
>>> that, we're using Tranquility core. In my local test, I'm using the
>>> "spark-1.6.1-bin-hadoop2.6" distribution and the following dependencies in
>>> my app:
>>>
>>> 
>>> org.apache.spark
>>> spark-streaming_2.10
>>> 1.6.1
>>> provided
>>> 
>>> 
>>> io.druid
>>> tranquility-core_2.10
>>> 0.7.4
>>> 
>>>
>>> But i getting the error down below when Tranquility tries to create
>>> Tranquilizer object:
>>>
>>> tranquilizer = 
>>> DruidBeams.fromConfig(dataSourceConfig).buildTranquilizer(tranquilizerBuider);
>>>
>>> The stacktrace is down below:
>>>
>>> java.lang.IllegalAccessError: tried to access method
>>> com.fasterxml.jackson.databind.introspect.AnnotatedMember.getAllAnnotations()Lcom/fasterxml/jackson/databind/introspect/AnnotationMap;
>>> from class
>>> com.fasterxml.jackson.databind.introspect.GuiceAnnotationIntrospector
>>> at
>>> com.fasterxml.jackson.databind.introspect.GuiceAnnotationIntrospector.findInjectableValueId(GuiceAnnotationIntrospector.java:39)
>>> at
>>> com.fasterxml.jackson.databind.introspect.AnnotationIntrospectorPair.findInjectableValueId(AnnotationIntrospectorPair.java:269)
>>> at
>>> com.fasterxml.jackson.databind.deser.BasicDeserializerFactory._addDeserializerConstructors(BasicDeserializerFactory.java:433)
>>> at
>>> com.fasterxml.jackson.databind.deser.BasicDeserializerFactory._constructDefaultValueInstantiator(BasicDeserializerFactory.java:325)
>>> at
>>> com.fasterxml.jackson.databind.deser.B

What version of twitter4j should I use with Spark Streaming?UPDATING thread

2016-03-01 Thread Alonso Isidoro Roman
hi, i read this post recommending to use twitter4j-3.0.3* and it is not
working for me. I want to load this jars within spark-shell without any
lucky. This is the output


   1. *MacBook-Pro-Retina-de-Alonso:spark-1.6 aironman$ ls *jar*
   2. mysql-connector-java-5.1.30.jar twitter4j-async-3.0.3.jar
   twitter4j-examples-3.0.3.jartwitter4j-stream-3.0.3.jar
   3. spark-streaming-twitter_2.10-1.0.0.jar  twitter4j-core-3.0.3.jar
 twitter4j-media-support-3.0.3.jar
   4. *MacBook-Pro-Retina-de-Alonso:spark-1.6 aironman$* bin/spark-shell
   --jars spark-streaming-twitter_2.10-1.0.0.jar twitter4j-async-3.0.3.jar
   twitter4j-core-3.0.3.jar twitter4j-examples-3.0.3.jar
   twitter4j-media-support-3.0.3.jar twitter4j-stream-3.0.3.jar
   5. log4j:WARN No appenders could be found for logger
   (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
   6. log4j:WARN Please initialize the log4j system properly.
   7. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig
   for more info.
   8. 2016-03-01 18:00:33.542 java[35711:11354206] Unable to load realm
   info from SCDynamicStore
   9. Using Spark's repl log4j profile:
   org/apache/spark/log4j-defaults-repl.properties
   10. To adjust logging level use sc.setLogLevel("INFO")
   11. error: error while loading , error in opening zip file
   12.
   13. Failed to initialize compiler: object scala.runtime in compiler
   mirror not found.
   14. ** Note that as of 2.8 scala does not assume use of the java
   classpath.
   15. ** For the old behavior pass -usejavacp to scala, or if using a
   Settings
   16. ** object programatically, settings.usejavacp.value = true.
   17. 16/03/01 18:00:34 WARN SparkILoop$SparkILoopInterpreter: Warning:
   compiler accessed before init set up.  Assuming no postInit code.
   18.
   19. Failed to initialize compiler: object scala.runtime in compiler
   mirror not found.
   20. ** Note that as of 2.8 scala does not assume use of the java
   classpath.
   21. ** For the old behavior pass -usejavacp to scala, or if using a
   Settings
   22. ** object programatically, settings.usejavacp.value = true.
   23. Exception in thread "main" java.lang.AssertionError: assertion
   failed: null
   24. at scala.Predef$.assert(Predef.scala:179)
   25. at
   org.apache.spark.repl.SparkIMain.initializeSynchronous(SparkIMain.scala:247)
   26. at
   
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:990)
   27. at
   
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
   28. at
   
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
   29. at
   
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
   30. at org.apache.spark.repl.SparkILoop.org
   $apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
   31. at
   org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
   32. at org.apache.spark.repl.Main$.main(Main.scala:31)
   33. at org.apache.spark.repl.Main.main(Main.scala)
   34. at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   35. at
   sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   36. at
   
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   37. at java.lang.reflect.Method.invoke(Method.java:606)
   38. at
   
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
   39. at
   org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
   40. at
   org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
   41. at
   org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
   42. at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
   43. MacBook-Pro-Retina-de-Alonso:spark-1.6 aironman$




Do i have to update the jars version? because i see that actual version of
twitter4j is 4.0.4...

thanks

Alonso Isidoro Roman.

Mis citas preferidas (de hoy) :
"Si depurar es el proceso de quitar los errores de software, entonces
programar debe ser el proceso de introducirlos..."
 -  Edsger Dijkstra

My favorite quotes (today):
"If debugging is the process of removing software bugs, then programming
must be the process of putting ..."
  - Edsger Dijkstra

"If you pay peanuts you get monkeys"


Re: Spark Job Hanging on Join

2016-02-23 Thread Alonso Isidoro Roman
thanks for sharing the know how guys

Alonso Isidoro Roman.

Mis citas preferidas (de hoy) :
"Si depurar es el proceso de quitar los errores de software, entonces
programar debe ser el proceso de introducirlos..."
 -  Edsger Dijkstra

My favorite quotes (today):
"If debugging is the process of removing software bugs, then programming
must be the process of putting ..."
  - Edsger Dijkstra

"If you pay peanuts you get monkeys"


2016-02-23 9:43 GMT+01:00 Mohannad Ali <man...@gmail.com>:

> Hello Everyone,
>
> Thanks a lot for the help. We also managed to solve it but without
> resorting to spark 1.6.
>
> The problem we were having was because of a really bad join condition:
>
> ON ((a.col1 = b.col1) or (a.col1 is null and b.col1 is null)) AND ((a.col2
> = b.col2) or (a.col2 is null and b.col2 is null))
>
> So what we did was re-work our logic to remove the null checks in the join
> condition and the join went lightning fast afterwards :)
> On Feb 22, 2016 21:24, "Dave Moyers" <davemoy...@icloud.com> wrote:
>
>> Good article! Thanks for sharing!
>>
>>
>> > On Feb 22, 2016, at 11:10 AM, Davies Liu <dav...@databricks.com> wrote:
>> >
>> > This link may help:
>> >
>> https://forums.databricks.com/questions/6747/how-do-i-get-a-cartesian-product-of-a-huge-dataset.html
>> >
>> > Spark 1.6 had improved the CatesianProduct, you should turn of auto
>> > broadcast and go with CatesianProduct in 1.6
>> >
>> > On Mon, Feb 22, 2016 at 1:45 AM, Mohannad Ali <man...@gmail.com> wrote:
>> >> Hello everyone,
>> >>
>> >> I'm working with Tamara and I wanted to give you guys an update on the
>> >> issue:
>> >>
>> >> 1. Here is the output of .explain():
>> >>>
>> >>> Project
>> >>>
>> [sk_customer#0L,customer_id#1L,country#2,email#3,birthdate#4,gender#5,fk_created_at_date#6,age_range#7,first_name#8,last_name#9,inserted_at#10L,updated_at#11L,customer_id#25L
>> >>> AS new_customer_id#38L,country#24 AS new_country#39,email#26 AS
>> >>> new_email#40,birthdate#29 AS new_birthdate#41,gender#31 AS
>> >>> new_gender#42,fk_created_at_date#32 AS
>> >>> new_fk_created_at_date#43,age_range#30 AS
>> new_age_range#44,first_name#27 AS
>> >>> new_first_name#45,last_name#28 AS new_last_name#46]
>> >>> BroadcastNestedLoopJoin BuildLeft, LeftOuter, Somecustomer_id#1L =
>> >>> customer_id#25L) || (isnull(customer_id#1L) &&
>> isnull(customer_id#25L))) &&
>> >>> ((country#2 = country#24) || (isnull(country#2) &&
>> isnull(country#24)
>> >>>  Scan
>> >>>
>> PhysicalRDD[country#24,customer_id#25L,email#26,first_name#27,last_name#28,birthdate#29,age_range#30,gender#31,fk_created_at_date#32]
>> >>>  Scan
>> >>>
>> ParquetRelation[hdfs:///databases/dimensions/customer_dimension][sk_customer#0L,customer_id#1L,country#2,email#3,birthdate#4,gender#5,fk_created_at_date#6,age_range#7,first_name#8,last_name#9,inserted_at#10L,updated_at#11L]
>> >>
>> >>
>> >> 2. Setting spark.sql.autoBroadcastJoinThreshold=-1 didn't make a
>> difference.
>> >> It still hangs indefinitely.
>> >> 3. We are using Spark 1.5.2
>> >> 4. We tried running this with 4 executors, 9 executors, and even in
>> local
>> >> mode with master set to "local[4]". The issue still persists in all
>> cases.
>> >> 5. Even without trying to cache any of the dataframes this issue still
>> >> happens,.
>> >> 6. We have about 200 partitions.
>> >>
>> >> Any help would be appreciated!
>> >>
>> >> Best Regards,
>> >> Mo
>> >>
>> >> On Sun, Feb 21, 2016 at 8:39 PM, Gourav Sengupta <
>> gourav.sengu...@gmail.com>
>> >> wrote:
>> >>>
>> >>> Sorry,
>> >>>
>> >>> please include the following questions to the list above:
>> >>>
>> >>> the SPARK version?
>> >>> whether you are using RDD or DataFrames?
>> >>> is the code run locally or in SPARK Cluster mode or in AWS EMR?
>> >>>
>> >>>
>> >>> Regards,
>> >>> Gourav Sengupta
>> >>>
>> >>> On Sun, Feb 21, 2016 at 7:37 PM, Gourav Sengupta
>> >>> <gourav.sengu...@gmail.com> wrote:
>> >>>>
>> >>

Re: Stored proc with spark

2016-02-16 Thread Alonso Isidoro Roman
relational databases? what about sqoop?

https://en.wikipedia.org/wiki/Sqoop



Alonso Isidoro Roman.

Mis citas preferidas (de hoy) :
"Si depurar es el proceso de quitar los errores de software, entonces
programar debe ser el proceso de introducirlos..."
 -  Edsger Dijkstra

My favorite quotes (today):
"If debugging is the process of removing software bugs, then programming
must be the process of putting ..."
  - Edsger Dijkstra

"If you pay peanuts you get monkeys"


2016-02-16 10:04 GMT+01:00 Gaurav Agarwal <gaurav130...@gmail.com>:

> Hi
> Can I load the data into spark from oracle storedproc
>
> Thanks
>


Re: Spark job does not perform well when some RDD in memory and some on Disk

2016-02-04 Thread Alonso Isidoro Roman
"But learned that it is better not to reduce it to 0."

could you explain a bit more this sentence?

thanks

Alonso Isidoro Roman.

Mis citas preferidas (de hoy) :
"Si depurar es el proceso de quitar los errores de software, entonces
programar debe ser el proceso de introducirlos..."
 -  Edsger Dijkstra

My favorite quotes (today):
"If debugging is the process of removing software bugs, then programming
must be the process of putting ..."
  - Edsger Dijkstra

"If you pay peanuts you get monkeys"


2016-02-04 11:33 GMT+01:00 Prabhu Joseph <prabhujose.ga...@gmail.com>:

> Okay, the reason for the task delay within executor when some RDD in
> memory and some in Hadoop i.e, Multiple Locality Levels NODE_LOCAL and ANY,
> in this case Scheduler waits
> for *spark.locality.wait *3 seconds default. During this period,
> scheduler waits to launch a data-local task before giving up and launching
> it on a less-local node.
>
> So after making it 0, all tasks started parallel. But learned that it is
> better not to reduce it to 0.
>
>
> On Mon, Feb 1, 2016 at 2:02 PM, Prabhu Joseph <prabhujose.ga...@gmail.com>
> wrote:
>
>> Hi All,
>>
>>
>> Sample Spark application which reads a logfile from hadoop (1.2GB - 5
>> RDD's created each approx 250MB data) and there are two jobs. Job A gets
>> the line with "a" and the Job B gets the line with "b". The spark
>> application is ran multiple times, each time with
>> different executor memory, and enable/disable cache() function. Job A
>> performance is same in all the runs as it has to read the entire data first
>> time from Disk.
>>
>> Spark Cluster - standalone mode with Spark Master, single worker node (12
>> cores, 16GB memory)
>>
>> val logData = sc.textFile(logFile, 2)
>> var numAs = logData.filter(line => line.contains("a")).count()
>> var numBs = logData.filter(line => line.contains("b")).count()
>>
>>
>> *Job B (which has 5 tasks) results below:*
>>
>> *Run 1:* 1 executor with 2GB memory, 12 cores took 2 seconds [ran1 image]
>>
>> Since logData is not cached, the job B has to again read the 1.2GB
>> data from hadoop into memory and all the 5 tasks started parallel and each
>> took 2 sec (29ms for GC) and the
>>  overall job completed in 2 seconds.
>>
>> *Run 2:* 1 executor with 2GB memory, 12 cores and logData is cached took
>> 4 seconds [ran2 image, ran2_cache image]
>>
>>  val logData = sc.textFile(logFile, 2).cache()
>>
>>  The Executor does not have enough memory to cache and hence again
>> needs to read the entire 1.2GB data from hadoop into memory.  But since the
>> cache() is used, leads to lot of GC pause leading to slowness in task
>> completion. Each task started parallel and
>> completed in 4 seconds (more than 1 sec for GC).
>>
>> *Run 3: 1 executor with 6GB memory, 12 cores and logData is cached took
>> 10 seconds [ran3 image]*
>>
>>  The Executor has memory that can fit 4 RDD partitions into memory
>> but 5th RDD it has to read from Hadoop. 4 tasks are started parallel and
>> they completed in 0.3 seconds without GC. But the 5th task which has to
>> read RDD from disk is started after 4 seconds, and gets completed in 2
>> seconds. Analysing why the 5th task is not started parallel with other
>> tasks or at least why it is not started immediately after the other task
>> completion.
>>
>> *Run 4:* 1 executor with 16GB memory , 12 cores and logData is cached
>> took 0.3 seconds [ran4 image]
>>
>>  The executor has enough memory to cache all the 5 RDD. All 5 tasks
>> are started in parallel and gets completed within 0.3 seconds.
>>
>>
>> So Spark performs well when entire input data is in Memory or None. In
>> case of some RDD in memory and some from disk, there is a delay in
>> scheduling the fifth task, is it a expected behavior or a possible Bug.
>>
>>
>>
>> Thanks,
>> Prabhu Joseph
>>
>>
>>
>>
>


Re: How to integrate Spark with OpenCV?

2015-01-14 Thread Alonso Isidoro Roman
thanks Jorn, sorry for the special character your name needs, i dont know
how to use it. I was thinking the same. Do you know somebody that tries to
use this approach?

Alonso Isidoro Roman.

Mis citas preferidas (de hoy) :
Si depurar es el proceso de quitar los errores de software, entonces
programar debe ser el proceso de introducirlos...
 -  Edsger Dijkstra

My favorite quotes (today):
If debugging is the process of removing software bugs, then programming
must be the process of putting ...
  - Edsger Dijkstra

If you pay peanuts you get monkeys


2015-01-14 12:50 GMT+01:00 Jörn Franke jornfra...@gmail.com:

 Basically, you have to think about how to split the data (for pictures
 this can be for instance 8x8 matrices) and use spark to distribute it to
 different workers which themselves call opencv with the data. Afterwards
 you need to combine all results again. It really depends on your image /
 video domain how you do this, so I can give you only general instructions.
 Le 14 janv. 2015 12:44, Alonso Isidoro Roman alons...@gmail.com a
 écrit :

 I was thinking about if it could be possible to use apache spark, and
 opencv in order to recognize stellar objects from the data set provided by
 Nasa,Esa, and the others space agencies. I am asking about advice or
 feelings about it, nor example codes or something else, but if there were
 any, please provide.

 Thanks again and forgive the possible off topic.

 Alonso

 Alonso Isidoro Roman.

 Mis citas preferidas (de hoy) :
 Si depurar es el proceso de quitar los errores de software, entonces
 programar debe ser el proceso de introducirlos...
  -  Edsger Dijkstra

 My favorite quotes (today):
 If debugging is the process of removing software bugs, then programming
 must be the process of putting ...
   - Edsger Dijkstra

 If you pay peanuts you get monkeys


 2015-01-14 10:49 GMT+01:00 Akhil Das ak...@sigmoidanalytics.com:

 Here's an example for detecting the face
 https://github.com/openreserach/bin2seq/blob/master/src/test/java/com/openresearchinc/hadoop/test/SparkTest.java#L190
 you might find it useful.

 Thanks
 Best Regards

 On Wed, Jan 14, 2015 at 3:06 PM, jishnu.prat...@wipro.com wrote:

  Hi Akhil

 Thanks for the response

 Our use case is  Object detection in  multiple videos. It's kind of
 searching an image if present in the video by matching the image with all
 the frames of the video. I am able to do it in normal java code using
 OpenCV lib now but I don't think it is scalable to an extend we could
 implement it for thousands of large sized videos. So I thought we could
 leverage distributed computing and performance of spark *If possible*
 *.*

 I could see Jaonary Rabarisoa
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=user_nodesuser=340
 has tried to use OpenCV with spark
 http://apache-spark-user-list.1001560.n3.nabble.com/Getting-started-using-spark-for-computer-vision-and-video-analytics-td1551.html.
 But I don't have any code reference on how to do it with OpenCV.

 In case any Image+Video processing library works better with Spark plz
 let me know. Any help would be really appreciated.

 .

 Thanks  Regards

 Jishnu Menath Prathap

 *From:* Akhil Das [mailto:ak...@sigmoidanalytics.com]
 *Sent:* Wednesday, January 14, 2015 12:35 PM
 *To:* Jishnu Menath Prathap (WT01 - BAS)
 *Cc:* user@spark.apache.org
 *Subject:* Re: How to integrate Spark with OpenCV?



 I ddn't played with OpenCV yet, but i was just wondering about your
 use-case. What exactly are you trying to do?


   Thanks

 Best Regards



 Jishnu Prathap jishnu.prat...@wipro.com wrote:



 Hi, Can somone suggest any Video+image processing library which works
 well with spark. Currently i am trying to integrate OpenCV with Spark. I am
 relatively new to both spark and OpenCV It would really help me if someone
 could share some sample code how to use Mat ,IplImage and spark rdd 's
 together .Any help would be really appreciated. Thanks in Advance!!
  --

 View this message in context: How to integrate Spark with OpenCV?
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-integrate-Spark-with-OpenCV-tp21133.html
 Sent from the Apache Spark User List mailing list archive
 http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.








Re: How to integrate Spark with OpenCV?

2015-01-14 Thread Alonso Isidoro Roman
I was thinking about if it could be possible to use apache spark, and
opencv in order to recognize stellar objects from the data set provided by
Nasa,Esa, and the others space agencies. I am asking about advice or
feelings about it, nor example codes or something else, but if there were
any, please provide.

Thanks again and forgive the possible off topic.

Alonso

Alonso Isidoro Roman.

Mis citas preferidas (de hoy) :
Si depurar es el proceso de quitar los errores de software, entonces
programar debe ser el proceso de introducirlos...
 -  Edsger Dijkstra

My favorite quotes (today):
If debugging is the process of removing software bugs, then programming
must be the process of putting ...
  - Edsger Dijkstra

If you pay peanuts you get monkeys


2015-01-14 10:49 GMT+01:00 Akhil Das ak...@sigmoidanalytics.com:

 Here's an example for detecting the face
 https://github.com/openreserach/bin2seq/blob/master/src/test/java/com/openresearchinc/hadoop/test/SparkTest.java#L190
 you might find it useful.

 Thanks
 Best Regards

 On Wed, Jan 14, 2015 at 3:06 PM, jishnu.prat...@wipro.com wrote:

  Hi Akhil

 Thanks for the response

 Our use case is  Object detection in  multiple videos. It's kind of
 searching an image if present in the video by matching the image with all
 the frames of the video. I am able to do it in normal java code using
 OpenCV lib now but I don't think it is scalable to an extend we could
 implement it for thousands of large sized videos. So I thought we could
 leverage distributed computing and performance of spark *If possible**.*

 I could see Jaonary Rabarisoa
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=user_nodesuser=340
 has tried to use OpenCV with spark
 http://apache-spark-user-list.1001560.n3.nabble.com/Getting-started-using-spark-for-computer-vision-and-video-analytics-td1551.html.
 But I don't have any code reference on how to do it with OpenCV.

 In case any Image+Video processing library works better with Spark plz
 let me know. Any help would be really appreciated.

 .

 Thanks  Regards

 Jishnu Menath Prathap

 *From:* Akhil Das [mailto:ak...@sigmoidanalytics.com]
 *Sent:* Wednesday, January 14, 2015 12:35 PM
 *To:* Jishnu Menath Prathap (WT01 - BAS)
 *Cc:* user@spark.apache.org
 *Subject:* Re: How to integrate Spark with OpenCV?



 I ddn't played with OpenCV yet, but i was just wondering about your
 use-case. What exactly are you trying to do?


   Thanks

 Best Regards



 Jishnu Prathap jishnu.prat...@wipro.com wrote:



 Hi, Can somone suggest any Video+image processing library which works
 well with spark. Currently i am trying to integrate OpenCV with Spark. I am
 relatively new to both spark and OpenCV It would really help me if someone
 could share some sample code how to use Mat ,IplImage and spark rdd 's
 together .Any help would be really appreciated. Thanks in Advance!!
  --

 View this message in context: How to integrate Spark with OpenCV?
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-integrate-Spark-with-OpenCV-tp21133.html
 Sent from the Apache Spark User List mailing list archive
 http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.







Re: how to run spark function in a tomcat servlet

2014-12-10 Thread Alonso Isidoro Roman
Hi, i think this post
http://stackoverflow.com/questions/2681759/is-there-anyway-to-exclude-artifacts-inherited-from-a-parent-pom
shall help you.

Alonso Isidoro Roman.

Mis citas preferidas (de hoy) :
Si depurar es el proceso de quitar los errores de software, entonces
programar debe ser el proceso de introducirlos...
 -  Edsger Dijkstra

My favorite quotes (today):
If debugging is the process of removing software bugs, then programming
must be the process of putting ...
  - Edsger Dijkstra

If you pay peanuts you get monkeys


2014-12-10 17:19 GMT+01:00 bai阿蒙 smallmonkey...@hotmail.com:

 Hi guys:
 I want to call the RDD api in a servlet which run in a tomcat. So I add
 the spark-core.jar to the web-inf/lib of the web project .And deploy it to
 tomcat.  but In spark-core.jar there exist the httpserlet which belongs to
 jetty. then there is some conflict. Can Anybody tell me how to resolve it?
 thanks a lot

 baishuo



about a JavaWordCount example with spark-core_2.10-1.0.0.jar

2014-06-23 Thread Alonso Isidoro Roman
 MongoOutputFormat and config are relevant

save.saveAsNewAPIHadoopFile(file:/bogus, Object.class, Object.
class, MongoOutputFormat.class, config);

}



}

It looks like jar hell dependency, isn't it?  can anyone guide or help me?

Another thing, i don t like closures, is it possible to use this fw without
using it?
Another question, are this objects, JavaSparkContext sc, JavaPairRDDObject,
BSONObject mongoRDD ThreadSafe? Can i use them as singleton?

Thank you very much and apologizes if the questions are not trending topic
:)

Alonso Isidoro Roman.

Mis citas preferidas (de hoy) :
Si depurar es el proceso de quitar los errores de software, entonces
programar debe ser el proceso de introducirlos...
 -  Edsger Dijkstra

My favorite quotes (today):
If debugging is the process of removing software bugs, then programming
must be the process of putting ...
  - Edsger Dijkstra

If you pay peanuts you get monkeys


Re: Beginners Hadoop question

2014-03-03 Thread Alonso Isidoro Roman
Hi, i am a beginner too, but as i have learned, hadoop works better with
big files, at least with 64MB, 128MB or even more. I think you need to
aggregate all the files into a new big one. Then you must copy to HDFS
using this command:

hadoop fs -put MYFILE /YOUR_ROUTE_ON_HDFS/MYFILE

hadoop just copy MYFILE into hadoop distributed file system.

Can i recommend you what i have done? go to BigDataUniversity.com and take
the Hadoop Fundamentals I course. It is free and very well documented.

Regards

Alonso Isidoro Roman.

Mis citas preferidas (de hoy) :
Si depurar es el proceso de quitar los errores de software, entonces
programar debe ser el proceso de introducirlos...
 -  Edsger Dijkstra

My favorite quotes (today):
If debugging is the process of removing software bugs, then programming
must be the process of putting ...
  - Edsger Dijkstra

If you pay peanuts you get monkeys



2014-03-03 12:10 GMT+01:00 goi cto goi@gmail.com:

 Hi,

 I am sorry for the beginners question but...
 I have a spark java code which reads a file (c:\my-input.csv) process it
 and writes an output file (my-output.csv)
 Now I want to run it on Hadoop in a distributed environment
 1) My inlut file should be one big file or separate smaller files?
 2) if we are using smaller files, how does my code needs to change to
 process all of the input files?

 Will Hadoop just copy the files to different servers or will it also split
 their content among servers?

 Any example will be great!
 --
 Eran | CTO