Re: Write to Cassandra table from pyspark fails with scala reflect error

2016-09-14 Thread Russell Spitzer
Spark 2.0 defaults to Scala 2.11, so if you didn't build it yourself you need the 2.11 artifact for the Spark Cassandra Connector. On Wed, Sep 14, 2016 at 7:44 PM Trivedi Amit wrote: > Hi, > > > > I am testing a pyspark program that will read from a csv file and

Re: Write to Cassandra table from pyspark fails with scala reflect error

2016-09-14 Thread Trivedi Amit
Hi, I am testing a pyspark program that will read from a csv file and write data into Cassandra table. I am using pyspark with spark-cassandra-connector 2.10:2.0.0-M3. I am using Spark v2.0.0. While executing below command

Re: Efficiently write a Dataframe to Text file(Spark Version 1.6.1)

2016-09-14 Thread sanat kumar Patnaik
The performance I mentioned here is all on local(my laptop). I have tried the same thing on cluster(Elastic MapReduce) and have seen even worse results. Is there a way this can be done efficiently?If any of you might have tried it. On Wednesday, September 14, 2016, Jörn Franke

Job Opportunity

2016-09-14 Thread Data Junkie
Looking for seasoned Apache Spark/Scala developers in the US - east coast or west coast. Please mail to apachesparkrecruitm...@gmail.com if interested with your resume. No headhunters/outsourcing.

Job Opportunity

2016-09-14 Thread datajunkie
Looking for seasoned Apache Spark/Scala developers in the US - east coast or west coast. Please mail to apachesparkrecruitm...@gmail.com if interested with your resume. No headhunters/outsourcing. -- View this message in context:

Re: Please assist: migrating RandomForestExample from MLLib to ML

2016-09-14 Thread Marco Mistroni
many thanks Sean! kr marco On Wed, Sep 14, 2016 at 10:33 PM, Sean Owen wrote: > If it helps, I've already updated that code for the 2nd edition, which > will be based on ~Spark 2.1: > > https://github.com/sryza/aas/blob/master/ch04-rdf/src/main/ >

Re: Please assist: migrating RandomForestExample from MLLib to ML

2016-09-14 Thread Sean Owen
If it helps, I've already updated that code for the 2nd edition, which will be based on ~Spark 2.1: https://github.com/sryza/aas/blob/master/ch04-rdf/src/main/scala/com/cloudera/datascience/rdf/RunRDF.scala#L220 This should be an equivalent working example that deals with categoricals via

Please assist: migrating RandomForestExample from MLLib to ML

2016-09-14 Thread Marco Mistroni
hi all i have been toying around with this well known RandomForestExample code val forest = RandomForest.trainClassifier( trainData, 7, Map(10 -> 4, 11 -> 40), 20, "auto", "entropy", 30, 300) This comes from this link (

Re: Streaming - lookup against reference data

2016-09-14 Thread Jörn Franke
Hmm is it just a lookup and the values are small? I do not think that in this case redis needs to be installed on each worker node. Redis has a rather efficient protocol. Hence one or a few dedicated redis nodes probably fit your purpose more then needed. Just try to reuse connections and do

Re: RMSE in ALS

2016-09-14 Thread Sean Owen
Yes, that's what TF-IDF is, but it's just a statistic and not a ranking. If you're using that to fill in a user-item matrix then that is your model; you don't need ALS. Building an ALS model on this is kind of like building a model on a model. Applying RMSE in this case is a little funny, given

Re: RMSE in ALS

2016-09-14 Thread Pasquinell Urbani
The implicit rankings are the output of Tf-idf. I.e.: Each_ranking= frecuency of an ítem * log(amount of total customers/amount of customers buying the ítem) El 14 sept. 2016 17:14, "Sean Owen" escribió: > What are implicit rankings here? > RMSE would not be an appropriate

Re: Not all KafkaReceivers processing the data Why?

2016-09-14 Thread Jeff Nadler
Sure the partitions exist, but is there data in all partitions? Try the kafka offset checker: kafka-run-class.sh kafka.tools.ConsumerOffsetChecker --zookeeper localhost:2181 -group -topic On Wed, Sep 14, 2016 at 1:00 PM, wrote: > Sure thanks I have

Re: RMSE in ALS

2016-09-14 Thread Sean Owen
What are implicit rankings here? RMSE would not be an appropriate measure for comparing rankings. There are ranking metrics like mean average precision that would be appropriate instead. On Wed, Sep 14, 2016 at 9:11 PM, Pasquinell Urbani < pasquinell.urb...@exalitica.com> wrote: > It was a typo

Re: RMSE in ALS

2016-09-14 Thread Pasquinell Urbani
It was a typo mistake, both are rmse. The frecency distribution of rankings is the following [image: Imágenes integradas 2] As you can see, I have heavy tail, but the majority of the observations rely near ranking 5. I'm working with implicit rankings (generated by TF-IDF), can this affect

Re: Not all KafkaReceivers processing the data Why?

2016-09-14 Thread Jeremy Smith
Take a look at how the messages are actually distributed across the partitions. If the message keys have a low cardinality, you might get poor distribution (i.e. all the messages are actually only in two of the five partitions, leading to what you see in Spark). If you take a look at the Kafka

Re: RMSE in ALS

2016-09-14 Thread Sean Owen
There is no way to answer this without knowing what your inputs are like. If they're on the scale of thousands, that's small (good). If they're on the scale of 1-5, that's extremely poor. What's RMS vs RMSE? On Wed, Sep 14, 2016 at 8:33 PM, Pasquinell Urbani

Re: Not all KafkaReceivers processing the data Why?

2016-09-14 Thread Jeff Nadler
Have you checked your Kafka brokers to be certain that data is going to all 5 partitions?We use something very similar (but in Scala) and have no problems. Also you might not get the best response blasting both user+dev lists like this. Normally you'd want to use 'user' only. -Jeff On

Not all KafkaReceivers processing the data Why?

2016-09-14 Thread Rachana Srivastava
Hello all, I have created a Kafka topic with 5 partitions. And I am using createStream receiver API like following. But somehow only one receiver is getting the input data. Rest of receivers are not processign anything. Can you please help? JavaPairDStream messages = null;

RMSE in ALS

2016-09-14 Thread Pasquinell Urbani
Hi Community I'm performing an ALS for retail product recommendation. Right now I'm reaching rms_test = 2.3 and rmse_test = 32.5. Is this too much in your experience? Does the transformation of the ranking values important for having good errors? Thank you all. Pasquinell Urbani

LIVY VS Spark Job Server

2016-09-14 Thread SamyaMaiti
Hi Team, I am evaluating different ways to submit & monitor spark Jobs using REST Interfaces. When to use Livy vs Spark Job Server? Regards, Sam -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/LIVY-VS-Spark-Job-Server-tp27722.html Sent from the Apache

CPU Consumption of spark process

2016-09-14 Thread شجاع الرحمن بیگ
Hi All, I have a system with 6 physical cores and each core has 8 hardware threads resulting in 48 virtual cores. Following are the setting in configuration files. *spark-env.sh* export SPARK_WORKER_CORES=1 *spark-defaults.conf* spark.driver.cores 1 spark.executor.cores 1 spark.cores.max 1 So

Streaming - lookup against reference data

2016-09-14 Thread Tom Davis
Hi all, Interested in patterns people use in the wild for lookup against reference data sets from a Spark streaming job. The reference dataset will be updated during the life of the job (although being 30mins out of date wouldn't be an issue, for example). So far I have come up with a few

Re: Spark Interview questions

2016-09-14 Thread Jacek Laskowski
Hi, Doh, Mich, it's way too much to ask for "typical Spark interview questions for Spark/Scala junior roles". There are plenty of such questions and I don't think there's a way to have them all noted down. Spark supports 5 languages, offers 4 modules + Core, and presents itself differently to

Best Practices for Spark-Python App Deployment

2016-09-14 Thread RK Aduri
Dear All: We are trying to deploy ( using Jenkins ) a spark-python app on an edge node, however the dilemma is whether to clone the git repo to all the nodes in the cluster. The reason is, if we choose to use the deployment mode as cluster and master as yarn, then driver expects the

Re: Spark kafka integration issues

2016-09-14 Thread Cody Koeninger
Yeah, an updated version of that blog post is available at https://github.com/koeninger/kafka-exactly-once On Wed, Sep 14, 2016 at 11:35 AM, Mukesh Jha wrote: > Thanks for the reply Cody. > > I found the below article on the same, very helpful. Thanks for the details, >

Re: Spark SQL - Applying transformation on a struct inside an array

2016-09-14 Thread Fred Reiss
+1 to this request. I talked last week with a product group within IBM that is struggling with the same issue. It's pretty common in data cleaning applications for data in the early stages to have nested lists or sets inconsistent or incomplete schema information. Fred On Tue, Sep 13, 2016 at

The coming data on Spark Streaming

2016-09-14 Thread pcandido
Hi everyone, I'm starting in Spark Streaming and would like to know somethings about data arriving. I know that SS uses micro-batches and they are received by workers and sent to RDD. The master, on defined intervals, receives a poiter to micro-batch in RDD and can use it to process data using

Re: Reading the most recent text files created by Spark streaming

2016-09-14 Thread Jörn Franke
Hi, An alternative to Spark could be flume to store data from Kafka to HDFS. It provides also some reliability mechanisms and has been explicitly designed for import/export and is tested. Not sure if i would go for spark streaming if the use case is only storing, but I do not have the full

Re: ACID transactions on data added from Spark not working

2016-09-14 Thread Mich Talebzadeh
Hi, I believe this is an issue with Spark handing transactional tables in Hive. When you add rows from Spark to ORC transactional table, Hive metadata tables HIVE_LOCKS and TXNS tables are not updated. This does not happen with Hive itself. As a result these new rows are left in an inconsistent

ACID transactions on data added from Spark not working

2016-09-14 Thread Jack Wenger
Hi there, I'm trying to use ACID transactions in Hive but I have a problem when the data are added with Spark. First, I created a table with the following statement : __ CREATE TABLE testdb.test(id string, col1

Re: Spark kafka integration issues

2016-09-14 Thread Mukesh Jha
Thanks for the reply Cody. I found the below article on the same, very helpful. Thanks for the details, much appreciated. http://blog.cloudera.com/blog/2015/03/exactly-once-spark-streaming-from-apache-kafka/ On Tue, Sep 13, 2016 at 8:14 PM, Cody Koeninger wrote: > 1. see

Re: Add sqldriver.jar to Spark 1.6.0 executors

2016-09-14 Thread Marcelo Vanzin
Use: spark-submit --jars /path/sqldriver.jar --conf spark.driver.extraClassPath=sqldriver.jar --conf spark.executor.extraClassPath=sqldriver.jar In client mode the driver's classpath needs to point to the full path, not just the name. On Wed, Sep 14, 2016 at 5:42 AM, Kevin Tran

Reading the most recent text files created by Spark streaming

2016-09-14 Thread Mich Talebzadeh
Hi, I have a Spark streaming that reads messages/prices from Kafka and writes it as text file to HDFS. This is pretty efficient. Its only function is to persist the incoming messages to HDFS. This is what it does dstream.foreachRDD { pricesRDD => val x= pricesRDD.count //

答复: 答复: t it does not stop at breakpoints which is in an anonymous function

2016-09-14 Thread chen yong
Dear Dirceu, thanks you again. Actually,I never saw it stopped at the breakpoints no matter how long I wait. It just skipped the whole anonymous function to direactly reach the first breakpoint immediately after the anonymous function body. Is that normal? I suspect sth wrong in my

RE: how to specify cores and executor to run spark jobs simultaneously

2016-09-14 Thread 박경희
Hi Divya I think, you did try to run spark jobs on yarn. And also I think, you would like to submit jobs to different queues on yarn for each. And maybe you need to prepare queues on yarn to run spark jobs by configuring scheduler. Best regards,KyeongHee - Original Message -

Re: Streaming Backpressure with Multiple Streams

2016-09-14 Thread Jeff Nadler
So as you were maybe thinking, it only happens with the combination: Direct Stream only + backpressure = works as expected 4x Receiver on Topic A + Direct Stream on Topic B + backpressure = the direct stream is throttled even in the absence of scheduling delay This is using Spark 1.5.0 on CDH.

RE: Anyone got a good solid example of integrating Spark and Solr

2016-09-14 Thread Adline Dsilva
Hi Take a look into https://github.com/lucidworks/spark-solr . this support authentication with kerberized solr. Unfortunately this implementation has support from solr 5.x+. and CDH has Solr 4.x. One option is to use Apache Solr 6.X with CDH. Regards, Adline Sent from

Re: Sqoop on Spark

2016-09-14 Thread Mich Talebzadeh
Sqoop is a standalone product (a utility) that is used to get data out of JDBC compliant database tables into HDFS and Hive if specified. Spark can also use JDBC to get data out from such tables. However, I have not come across a situation where Sqoop is invoked from Spark. Have a look at Sqoop

答复: t it does not stop at breakpoints which is in an anonymous function

2016-09-14 Thread chen yong
Thanks for your reply. you mean i have to insert some codes, such as x.count or x.collect, between the original spark code lines to invoke some operations, right? but, where is the right places to put my code lines? Felix 发件人: Dirceu Semighini Filho

Re: unsubscribe

2016-09-14 Thread Daniel Lopes
Hi Chang, just send a e-mail to user-unsubscr...@spark.apache.org Best, *Daniel Lopes* Chief Data and Analytics Officer | OneMatch c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes www.onematch.com.br On Tue,

Re: t it does not stop at breakpoints which is in an anonymous function

2016-09-14 Thread Dirceu Semighini Filho
Hello Felix, Spark functions run lazy, and that's why it doesn't stop in those breakpoints. They will be executed only when you call some methods of your dataframe/rdd, like the count, collect, ... Regards, Dirceu 2016-09-14 11:26 GMT-03:00 chen yong : > Hi all, > > > > I am

Sqoop on Spark

2016-09-14 Thread KhajaAsmath Mohammed
Hi Experts, Good morning. I am looking for some references on how to use sqoop with spark. could you please let me know if there are any references on how to use it. Thanks, Asmath.

t it does not stop at breakpoints which is in an anonymous function

2016-09-14 Thread chen yong
Hi all, I am newbie to spark. I am learning spark by debugging the spark code. It is strange to me that it does not stop at breakpoints which is in an anonymous function, it is normal in ordianry function, though. It that normal. How to obverse variables in an anonymous function. Please

Re: Efficiently write a Dataframe to Text file(Spark Version 1.6.1)

2016-09-14 Thread Jörn Franke
It could be that by using the rdd it converts the data from the internal format to Java objects (-> much more memory is needed), which may lead to spill over to disk. This conversion takes a lot of time. Then, you need to transfer these Java objects via network to one single node (repartition

Re: Spark Interview questions

2016-09-14 Thread Mich Talebzadeh
Hi Ashok, I am sure we all have some war stories some of which I recall: 1. What is meant by RDD, DataFrame and Dataset 2. What is the meant by "All transformations in Spark are lazy"? 3. What are the two types of operations supported by RDD? 4. What is meant by Spark running under

Error casting from data frame to case class object

2016-09-14 Thread franz_butterbaum
I'm exploring features of Spark 2.0, and am trying to load a simple csv file into a dataset. These are the contents of my file named people.csv: name,age,occupation John,21,student Mark,33,analyst Susan,27,scientist Below is my code: import org.apache.spark.sql._ val spark =

Re: Efficiently write a Dataframe to Text file(Spark Version 1.6.1)

2016-09-14 Thread Mich Talebzadeh
As I understand you cannot deliver json file downstream as they want text format. If it is batch processing, what is the window of delivery within the SLA? To write a 3GB file in 160 seconds means that it takes > 50 seconds to write 1 Gig which looks a long time to me. Even talking one minute

Add sqldriver.jar to Spark 1.6.0 executors

2016-09-14 Thread Kevin Tran
Hi Everyone, I tried in cluster mode on YARN * spark-submit --jars /path/sqldriver.jar * --driver-class-path * spark-env.sh SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/path/*" * spark-defaults.conf spark.driver.extraClassPath spark.executor.extraClassPath None of them works for me ! Does

Re: Efficiently write a Dataframe to Text file(Spark Version 1.6.1)

2016-09-14 Thread Jörn Franke
Hi, DataFrames are more efficient if you have Tungsten activated as the underlying processing engine (normally by default). However, this only speeds up processing , saving as an io-bound operation not necessarily. What is exactly slow? The write? You could use myDF.write.save().write...

Re: Efficiently write a Dataframe to Text file(Spark Version 1.6.1)

2016-09-14 Thread sanat kumar Patnaik
These are not csv files, utf8 files with a specific delimiter. I tried this out with a file(3 GB): myDF.write.json("output/myJson") Time taken- 60 secs approximately. myDF.rdd.repartition(1).saveAsTextFile("output/text") Time taken 160 secs That is where I am concerned, the time to write a text

Re: Efficiently write a Dataframe to Text file(Spark Version 1.6.1)

2016-09-14 Thread Mich Talebzadeh
These intermediate file what sort of files are there. Are there csv type files. I agree that DF is more efficient than an RDD as it follows tabular format (I assume that is what you mean by "columnar" format). So if you read these files in a bath process you may not worry too much about execution

Anyone got a good solid example of integrating Spark and Solr

2016-09-14 Thread Nkechi Achara
Hi All, I am trying to find some good examples on how to implement Spark and Solr and coming up blank. Basically the implementation of spark-solr does not seem to work correctly with CDH 552 (*1.5.x branch) where i am receiving various issues relating to dependencies, which I have not been fully

Efficiently write a Dataframe to Text file(Spark Version 1.6.1)

2016-09-14 Thread sanat kumar Patnaik
Hi All, - I am writing a batch application using Spark SQL and Dataframes. This application has a bunch of file joins and there are intermediate points where I need to drop a file for downstream applications to consume. - The problem is all these downstream applications are still on

Spark Interview questions

2016-09-14 Thread Ashok Kumar
Hi, As a learner I appreciate if you have typical Spark interview questions for Spark/Scala junior roles that you can please forward to me. I will be very obliged

Re: Spark Streaming - dividing DStream into mini batches

2016-09-14 Thread Daan Debie
Thanks for the awesome explanation! It's super clear to me now :) On Tue, Sep 13, 2016 at 4:42 PM, Cody Koeninger wrote: > The DStream implementation decides how to produce an RDD for a time > (this is the compute method) > > The RDD implementation decides how to partition

Re: Spark stalling during shuffle (maybe a memory issue)

2016-09-14 Thread bogdanbaraila
Hello Jonathan Did you found any working solution for your issue? If yes could you please share it? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-stalling-during-shuffle-maybe-a-memory-issue-tp6067p27716.html Sent from the Apache Spark User

Re: Spark SQL Thriftserver

2016-09-14 Thread Mich Talebzadeh
Actually this is what it says Connecting to jdbc:hive2://rhes564:10055 Connected to: Spark SQL (version 2.0.0) Driver: Hive JDBC (version 1.2.1.spark2) Transaction isolation: TRANSACTION_REPEATABLE_READ Beeline version 1.2.1.spark2 by Apache Hive So it uses Spark SQL. However, they do not seem

Re: how to specify cores and executor to run spark jobs simultaneously

2016-09-14 Thread Deepak Sharma
I am not sure about EMR , but seems multi tenancy is not enabled in your case. Multi tenancy means all the applications has to be submitted to different queues. Thanks Deepak On Wed, Sep 14, 2016 at 11:37 AM, Divya Gehlot wrote: > Hi, > > I am on EMR cluster and My

how to specify cores and executor to run spark jobs simultaneously

2016-09-14 Thread Divya Gehlot
Hi, I am on EMR cluster and My cluster configuration is as below: Number of nodes including master node - 3 Memory:22.50 GB VCores Total : 16 Active Nodes : 2 Spark version- 1.6.1 Parameter set in spark-default.conf spark.executor.instances 2 > spark.executor.cores 8 >