Re: installation of spark

2019-06-05 Thread Alonso Isidoro Roman
n simple way to configure these software? for instance, an > all-in-one configuration file? It takes forever for me to configure things > before I can really use it for data analysis. > > I hope my questions make sense. Thank you very much. > > Best regards, > > YA > &g

Re: write files of a specific size

2019-05-05 Thread Alonso Isidoro Roman
t; Hi All, >> My spark sql job produces output as per default partition and creates N >> number of files. >> I want to create each file as 100Mb sized in the final result. >> >> how can I do it ? >> >> thanks >> rajat >> >>

Re: [Spark2.1] SparkStreaming to Cassandra performance problem

2018-05-21 Thread Alonso Isidoro Roman
andra:0.7.0,org.apache.spark:spark-core_2.11:1.5.2 >>>> --conf spark.cassandra.connection.host='localhost' --num-executors 2 >>>> --executor-cores 2 SensorDataStreamHandler.py localhost:6667 test_topic2 >>>> ## >>>> >>>> # Run

Re: is it possible to create one KafkaDirectStream (Dstream) per topic?

2018-05-21 Thread Alonso Isidoro Roman
e > KafkaDirectStream (Dstream) per topic within the same JVM i.e using only > one sparkcontext? > > Thanks! > -- Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman <https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>

Re: Testing Spark-Cassandra

2018-01-17 Thread Alonso Isidoro Roman
there any util to make unit test > easily or which one would be the best way to do it? library? Cassandra with > docker? > -- Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman <https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>

Re: Multiple Kafka topics processing in Spark 2.2

2017-09-06 Thread Alonso Isidoro Roman
end the messages in multiple topics through console producer for > example. But when Spark receive the message, how it will know which topic > is this piece of message coming from? Thanks a lot for any of your helps! > > Cheers, > Dan > -- Alonso Isidoro Roman [image: h

Re: Need some help around a Spark Error

2017-07-26 Thread Alonso Isidoro Roman
I hope that helps https://stackoverflow.com/questions/40623957/slave-lost-and-very-slow-join-in-spark Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman <https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links> 2017

Re: how to identify the alive master spark via Zookeeper ?

2017-07-17 Thread Alonso Isidoro Roman
e-command-line> ... Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman <https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links> 2017-07-17 14:21 GMT+02:00 <marina.bru...@orange.com>: > Hello, > >

Re: Need Spark(Scala) Performance Tuning tips

2017-06-09 Thread Alonso Isidoro Roman
, Jacek`s <https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-tuning.html> Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman <https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links> 2017-06-09

Re: Java SPI jar reload in Spark

2017-06-06 Thread Alonso Isidoro Roman
Hi, a quick search on google. https://github.com/spark-jobserver/spark-jobserver/issues/130 Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman <https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links> 2017-06-06 12:

Re: Spark 2.1 - Infering schema of dataframe after reading json files not during

2017-06-02 Thread Alonso Isidoro Roman
where schema_parquet can be something like this: {"type" : "struct","fields" : [ {"name" : "column0","type" : "string","nullable" : false},{"name":"column1", "type":"string", "

Re: Worker node log not showed

2017-05-31 Thread Alonso Isidoro Roman
Are you running the code with yarn? if so, figure out the applicationID through the web ui, then run the next command: yarn logs your_application_id Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman <https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_med

Re: Spark Streaming: Custom Receiver OOM consistently

2017-05-21 Thread Alonso Isidoro Roman
could you share the code? Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman <https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links> 2017-05-20 7:54 GMT+02:00 Manish Malhotra <manish.malhotra.w...@gmail.com&g

Re: How to read large size files from a directory ?

2017-05-09 Thread Alonso Isidoro Roman
please create a github repo and upload the code there... Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman <https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links> 2017-05-09 8:47 GMT+02:00 ashwini anand

Re: Has anyone used CoreNLP from stanford for sentiment analysis in Spark? It does not work as desired for me.

2017-04-28 Thread Alonso Isidoro Roman
I forked some time ago a twitter analyzer, but i think the best is to provide the original link <https://github.com/vspiewak/twitter-sentiment-analysis>. If you want you can take a look to my fork <https://github.com/alonsoir/twitter-sentiment-analysis>. regards Alonso Isidoro

Re: 答复: 答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-20 Thread Alonso Isidoro Roman
forgive my ignorance, but, what does it mean HAR? a acronym to High available record? Thanks Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman <https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links> 2017-04-20 10:

Re: Any NLP library for sentiment analysis in Spark?

2017-04-12 Thread Alonso Isidoro Roman
I forked some time ago a project, maybe you can use it. https://github.com/alonsoir/SparkTwitterAnalyzer Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman <https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>

Re: Any NLP library for sentiment analysis in Spark?

2017-04-11 Thread Alonso Isidoro Roman
i did not use it yet, but this library looks promising: https://github.com/databricks/spark-corenlp Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman <https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links> 2017

Re: Benchmarking streaming frameworks

2017-04-03 Thread Alonso Isidoro Roman
I remember that yahoo did something similar... https://github.com/yahoo/streaming-benchmarks Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman <https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links> 2017-04-0

Re: How best we can store streaming data on dashboards for real time user experience?

2017-03-30 Thread Alonso Isidoro Roman
you can check if you want this link <https://github.com/alonsoir/twitter-sentiment-analysis> elastic, kibana and spark working together. Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman <https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_s

Re: How best we can store streaming data on dashboards for real time user experience?

2017-03-30 Thread Alonso Isidoro Roman
-analytics/ https://speakerdeck.com/elasticsearch/using-elasticsearch-logstash-and-kibana-to-create-realtime-dashboards https://www.youtube.com/watch?v=PuvHINcU9DI then take a look to https://kudu.apache.org/ Tell us later what you think. Alonso Isidoro Roman [image: https://]about.me

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Alonso Isidoro Roman
"Using Spark to query the data in the backend of the web UI?" Dont do that. I would recommend that spark streaming process stores data into some nosql or sql database and the web ui to query data from that database. Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.ro

Re: What is the difference between mini-batch vs real time streaming in practice (not theory)?

2016-09-27 Thread Alonso Isidoro Roman
mini batch or near real time: processing frames within 500 ms or more real time: processing frames in 5 ms-10ms. The main difference is processing velocity, i think. Apache Spark Streaming is mini batch, not true real time. Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman

Re: Extract timestamp from Kafka message

2016-09-26 Thread Alonso Isidoro Roman
hum, i think you have to embed the timestamp within the message... Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman <https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links> 2016-09-26 0:59 GMT+02:00 Kevin Tran

Re: Small files

2016-09-12 Thread Alonso Isidoro Roman
you still will have the small files problem anyway. You have to create an uber file before you upload it to the HDFS. Regards Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman <https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=

Re: Small files

2016-09-12 Thread Alonso Isidoro Roman
, tell us something about this issue Alonso Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman <https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links> 2016-09-12 12:39 GMT+02:00 ayan guha <guha.a...@gmail.com>:

Re: Real Time Recommendation Engines with Spark and Scala

2016-09-05 Thread Alonso Isidoro Roman
By the way, i would love to work in your project, looks promising! Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman <https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links> 2016-09-05 16:57 GMT+02:00 Alonso I

Re: Real Time Recommendation Engines with Spark and Scala

2016-09-05 Thread Alonso Isidoro Roman
the scala shell to create this unix command. Regards! Alonso Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman <https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links> 2016-09-05 16:39 GMT+02:00 Mich Talebzadeh <

Re: Real Time Recommendation Engines with Spark and Scala

2016-09-05 Thread Alonso Isidoro Roman
Hi Mitch, i wrote few months ago a tiny project with this issue in mind. The idea is to apply ALS algorithm in order to get some valid recommendations from another users. The url of the project <https://github.com/alonsoir/awesome-recommendation-engine> Alonso Isidoro Roman [image:

Re: Design patterns involving Spark

2016-08-30 Thread Alonso Isidoro Roman
Thanks Mitch, i will check it. Cheers Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman <https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links> 2016-08-30 9:52 GMT+02:00 Mich Talebzadeh <mich.talebza...@

Re: Design patterns involving Spark

2016-08-30 Thread Alonso Isidoro Roman
HBase for real time queries? HBase was designed with the batch in mind. Impala should be a best choice, but i do not know what Druid can do Cheers Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman <https://about.me/alonso.isidoro.roman?promo=email_sig_sou

Re: Can I control the execution of Spark jobs?

2016-06-16 Thread Alonso Isidoro Roman
Hi Wang, maybe you can consider to use an integration framework like Apache Camel in order to run differents jobs... Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman <https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campa

Re: Should I avoid "state" in an Spark application?

2016-06-13 Thread Alonso Isidoro Roman
Hi Haopu, please check these threads: http://stackoverflow.com/questions/24331815/spark-streaming-historical-state https://databricks.gitbooks.io/databricks-spark-reference-applications/content/logs_analyzer/chapter1/total.html Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman

Re: Spark Getting data from MongoDB in JAVA

2016-06-10 Thread Alonso Isidoro Roman
why *spark-mongodb_2.11 dependency is written twice in pom.xml?* Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman <https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links> 2016-06-10 11:39 GMT+02:00 Asfandyar Ashraf

Re: About a problem running a spark job in a cdh-5.7.0 vmware image.

2016-06-06 Thread Alonso Isidoro Roman
-count]$ free total used free sharedbuffers cached Mem: 806110466870441374060 3464 5796 484416 -/+ buffers/cache:61968321864272 Swap: 8388604 6875007701104 so, i can only use at least 1GB... Alonso Isidoro Roman

Re: About a problem running a spark job in a cdh-5.7.0 vmware image.

2016-06-06 Thread Alonso Isidoro Roman
uickstart ~]$ I know that sbt-launch is the sbt command running in another terminal, but, ¿Are NameNode processes and DataNode should not appear? Thank you very much for reading until here. Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman <https://about.me/alonso.isidoro

Re: About a problem running a spark job in a cdh-5.7.0 vmware image.

2016-06-04 Thread Alonso Isidoro Roman
) at example.spark.AmazonKafkaConnector.main(AmazonKafkaConnectorWithMongo.scala) Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman <https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links> 2016-06-03 18:23 GMT+02:00

Re: About a problem running a spark job in a cdh-5.7.0 vmware image.

2016-06-03 Thread Alonso Isidoro Roman
Thank you David, so, i would have to change the way that i am creating SparkConf object, isn't? I can see in this link that the way to run a spark job using YARN is using this

Re: About a problem when mapping a file located within a HDFS vmware cdh-5.7 image

2016-06-01 Thread Alonso Isidoro Roman
Thank you David, i will try to follow your advise. Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman <https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links> 2016-05-31 21:28 GMT+02:00 David Newberger <dav

Re: About a problem when mapping a file located within a HDFS vmware cdh-5.7 image

2016-05-31 Thread Alonso Isidoro Roman
Hi David, the one of the develop branch. I think It should be the same, but actually not sure... Regards Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman <https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links> 2

Re: Spark + Kafka processing trouble

2016-05-31 Thread Alonso Isidoro Roman
Mich`s idea is quite fine, if i was you, i will follow his idea... Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman <https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links> 2016-05-31 6:37 GMT+02:00 Mich Tale

Re: Logistic Regression in Spark Streaming

2016-05-27 Thread Alonso Isidoro Roman
I do not have any experience using LR in spark, but you can see that LR is already implemented in mllib. http://spark.apache.org/docs/latest/mllib-linear-methods.html Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman <https://about.me/alonso.isidoro.roman?pr

Re: about an exception when receiving data from kafka topic using Direct mode of Spark Streaming

2016-05-26 Thread Alonso Isidoro Roman
Thank you Cody, i will try to follow your advice. Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman <https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links> 2016-05-26 17:00 GMT+02:00 Cody Koeninger <c...@koen

Re: about an exception when receiving data from kafka topic using Direct mode of Spark Streaming

2016-05-26 Thread Alonso Isidoro Roman
ng data from kafka topic...") } //println("jsonParsed is " + jsonParsed) //The idea is to save results from Recommender.predict within mongodb, so i will have to deal with this issue //after resolving the issue of .foreachRDD(_.foreachPartition(recommende

Re: about an exception when receiving data from kafka topic using Direct mode of Spark Streaming

2016-05-25 Thread Alonso Isidoro Roman
ordered by rating (higher first) and only keep the first NumRecommendations val myUserId = userDict.getIndex(MyUsername) val recommendations = model.predict(candidates.map((myUserId, _))).collect val endAls = DateTime.now val result = recommendations.sortBy(-_.rating).take(NumRecommenda

about an exception when receiving data from kafka topic using Direct mode of Spark Streaming

2016-05-25 Thread Alonso Isidoro Roman
y not, what about using avro or parquet? producer.send(Json.toJson(amazonRating).toString) //producer.send(amazonRating) println("amazon product with rating sent to kafka cluster..." + amazonRating.toString) System.exit(0) } } } I have written a

Re: Spark JOIN Not working

2016-05-24 Thread Alonso Isidoro Roman
Could you share a bit of the dataset? difficult to test without data... Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman <https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links> 2016-05-24 8:43 GMT+02:00 Aakas

Re: [Spark 1.5.2]Check Foreign Key constraint

2016-05-11 Thread Alonso Isidoro Roman
I think that Impala and Hive have this feature. Impala is faster than hive, hive is probably better to use in batch mode. Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman <https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campa

Re: Multiple Spark Applications that use Cassandra, how to share resources/nodes

2016-05-04 Thread Alonso Isidoro Roman
Alonso Isidoro Roman. Mis citas preferidas (de hoy) : "Si depurar es el proceso de quitar los errores de software, entonces programar debe ser el proceso de introducirlos..." - Edsger Dijkstra My favorite quotes (today): "If debugging is the process of removin

Re: migration from Teradata to Spark SQL

2016-05-04 Thread Alonso Isidoro Roman
I agree with Deepak and i would try to save data in parquet and avro format, if you can, try to measure the performance and choose the best, it will probably be parquet, but you have to know for yourself. Alonso Isidoro Roman. Mis citas preferidas (de hoy) : "Si depurar es el proceso de q

Re: Code optimization

2016-04-19 Thread Alonso Isidoro Roman
Hi Angel, how about to use this : k.filter(k("WT_ID") as a val variable? i think you can avoid that and do not forget to use System.nanoTime to know the profit... Alonso Isidoro Roman. Mis citas preferidas (de hoy) : "Si depurar es el proceso de quitar los errores de sof

Re: Run a self-contained Spark app on a Spark standalone cluster

2016-04-12 Thread Alonso Isidoro Roman
I don't know how to do it with python, but scala has a plugin named sbt-pack that creates an auto contained unix command with your code, no need to use spark-submit. It should be out there something similar to this tool. Alonso Isidoro Roman. Mis citas preferidas (de hoy) : "Si depurar

Re: Problem with jackson lib running on spark

2016-03-31 Thread Alonso Isidoro Roman
g.apache.derby:derbyclient:jar:10.11.1.1:compile > [INFO] | +- com.google.inject:guice:jar:4.0:compile > [INFO] | | \- aopalliance:aopalliance:jar:1.0:compile > [INFO] | +- com.google.inject.extensions:guice-servlet:jar:4.0:compile > [INFO] | +- > com.google.inject.extensions:guice

Re: Problem with jackson lib running on spark

2016-03-31 Thread Alonso Isidoro Roman
Run mvn dependency:tree and print the output here, i suspect that jackson library is included within more than one dependency. Alonso Isidoro Roman. Mis citas preferidas (de hoy) : "Si depurar es el proceso de quitar los errores de software, entonces programar debe ser el proce

What version of twitter4j should I use with Spark Streaming?UPDATING thread

2016-03-01 Thread Alonso Isidoro Roman
oy.SparkSubmit$.main(SparkSubmit.scala:121) 42. at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 43. MacBook-Pro-Retina-de-Alonso:spark-1.6 aironman$ Do i have to update the jars version? because i see that actual version of twitter4j is 4.0.4... thanks Alonso Isidoro Roman. Mis cita

Re: Spark Job Hanging on Join

2016-02-23 Thread Alonso Isidoro Roman
thanks for sharing the know how guys Alonso Isidoro Roman. Mis citas preferidas (de hoy) : "Si depurar es el proceso de quitar los errores de software, entonces programar debe ser el proceso de introducirlos..." - Edsger Dijkstra My favorite quotes (today): "If debugging

Re: Stored proc with spark

2016-02-16 Thread Alonso Isidoro Roman
relational databases? what about sqoop? https://en.wikipedia.org/wiki/Sqoop Alonso Isidoro Roman. Mis citas preferidas (de hoy) : "Si depurar es el proceso de quitar los errores de software, entonces programar debe ser el proceso de introducirlos..." - Edsger Dijkstra My favor

Re: Spark job does not perform well when some RDD in memory and some on Disk

2016-02-04 Thread Alonso Isidoro Roman
"But learned that it is better not to reduce it to 0." could you explain a bit more this sentence? thanks Alonso Isidoro Roman. Mis citas preferidas (de hoy) : "Si depurar es el proceso de quitar los errores de software, entonces programar debe ser el proceso de introducirlos

Re: How to integrate Spark with OpenCV?

2015-01-14 Thread Alonso Isidoro Roman
thanks Jorn, sorry for the special character your name needs, i dont know how to use it. I was thinking the same. Do you know somebody that tries to use this approach? Alonso Isidoro Roman. Mis citas preferidas (de hoy) : Si depurar es el proceso de quitar los errores de software, entonces

Re: How to integrate Spark with OpenCV?

2015-01-14 Thread Alonso Isidoro Roman
, please provide. Thanks again and forgive the possible off topic. Alonso Alonso Isidoro Roman. Mis citas preferidas (de hoy) : Si depurar es el proceso de quitar los errores de software, entonces programar debe ser el proceso de introducirlos... - Edsger Dijkstra My favorite quotes (today

Re: how to run spark function in a tomcat servlet

2014-12-10 Thread Alonso Isidoro Roman
Hi, i think this post http://stackoverflow.com/questions/2681759/is-there-anyway-to-exclude-artifacts-inherited-from-a-parent-pom shall help you. Alonso Isidoro Roman. Mis citas preferidas (de hoy) : Si depurar es el proceso de quitar los errores de software, entonces programar debe ser el

about a JavaWordCount example with spark-core_2.10-1.0.0.jar

2014-06-23 Thread Alonso Isidoro Roman
, JavaPairRDDObject, BSONObject mongoRDD ThreadSafe? Can i use them as singleton? Thank you very much and apologizes if the questions are not trending topic :) Alonso Isidoro Roman. Mis citas preferidas (de hoy) : Si depurar es el proceso de quitar los errores de software, entonces programar debe ser el proceso

Re: Beginners Hadoop question

2014-03-03 Thread Alonso Isidoro Roman
copy MYFILE into hadoop distributed file system. Can i recommend you what i have done? go to BigDataUniversity.com and take the Hadoop Fundamentals I course. It is free and very well documented. Regards Alonso Isidoro Roman. Mis citas preferidas (de hoy) : Si depurar es el proceso de quitar los