Re: Reading multiple json files form nested folders for data frame

2016-07-20 Thread Ashutosh Kumar
That example points to a particular json file. Will it work same way if I point to top level folder containing all json files ? On Thu, Jul 21, 2016 at 12:04 PM, Simone wrote: > Yes you can - have a look here > http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets > > Hope

Re: run spark apps in linux crontab

2016-07-20 Thread Chanh Le
If you use > it only print the (print or println) to log file in the others log like (INFO, WARN, ERROR) -> (stdout) I believe it not print to the log file. But tee can do that. The following command (with the help of tee command) writes the output both to the screen (stdout) and to the file. T

回复:Re: run spark apps in linux crontab

2016-07-20 Thread luohui20001
got it. difference:> : all messages goes to the log file, leaving no messages in STDOUTtee: all message goes to the log file and STDOUT at the same time. Thanks&Best regards! San.Luo - 原始邮件 - 发件人:Chanh Le 收件人:luohui20...@sina.com 抄送人:focus , user 主

Re: Role-based S3 access outside of EMR

2016-07-20 Thread Gourav Sengupta
But that would mean you would be accessing data over internet increasing data read latency, data transmission failures. Why are you not using EMR? Regards, Gourav On Thu, Jul 21, 2016 at 1:06 AM, Everett Anderson wrote: > Thanks, Andy. > > I am indeed often doing something similar, now -- copyi

Re: Reading multiple json files form nested folders for data frame

2016-07-20 Thread Ashutosh Kumar
There is no database . I read files from google cloud storage /S3/hdfs. Thanks Ashutosh On Thu, Jul 21, 2016 at 11:50 AM, Sree Eedupuganti wrote: > Database you are using ? >

Reading multiple json files form nested folders for data frame

2016-07-20 Thread Ashutosh Kumar
I need to read bunch of json files kept in date wise folders and perform sql queries on them using data frame. Is it possible to do so? Please provide some pointers . Thanks Ashutosh

Optimal Amount of Tasks Per size of data in memory

2016-07-20 Thread Brandon White
What is the best heuristic for setting the number of partitions/task on an RDD based on the size of the RDD in memory? The Spark docs say that the number of partitions/tasks should be 2-3x the number of CPU cores but this does not make sense for all data sizes. Sometimes, this number is way to muc

R: ML PipelineModel to be scored locally

2016-07-20 Thread Simone
Thanks for your reply. I cannot rely on jpmml due licensing stuff. I can evaluate writing my own prediction code, but I am looking for a more general purpose approach. Any other thoughts? Best Simone - Messaggio originale - Da: "Peyman Mohajerian" Inviato: ‎20/‎07/‎2016 21:55 A: "Sim

Ratings in mllib.recommendation

2016-07-20 Thread glen

calculate time difference between consecutive rows

2016-07-20 Thread Divya Gehlot
I have a dataset of time as shown below : Time1 07:30:23 07:34:34 07:38:23 07:39:12 07:45:20 I need to find the diff between two consecutive rows I googled and found the *lag *function in *spark *helps in finding it . but its giving me *null *in the result set. Would really appreciate the help.

Re: Understanding spark concepts cluster, master, slave, job, stage, worker, executor, task

2016-07-20 Thread Sachin Mittal
Hi, Thanks for the links, is there any english translation for the same? Sachin On Thu, Jul 21, 2016 at 8:34 AM, Taotao.Li wrote: > Hi, Sachin, here are two posts about the basic concepts about spark: > > >- spark-questions-concepts >

Re: the spark job is so slow - almost frozen

2016-07-20 Thread Zhiliang Zhu
Thanks a lot for your kind help.  On Wednesday, July 20, 2016 11:35 AM, Andrew Ehrlich wrote: Try: - filtering down the data as soon as possible in the job, dropping columns you don’t need.- processing fewer partitions of the hive tables at a time- caching frequently accessed data, fo

Re: run spark apps in linux crontab

2016-07-20 Thread Mich Talebzadeh
you should source the environment file before or in the file. for example this one is ksh type 0,5,10,15,20,25,30,35,40,45,50,55 * * * * (/home/hduser/dba/bin/send_messages_to_Kafka.ksh > /var/tmp/send_messages_to_Kafka.err 2>&1) in that shell it sources the environment file # # Main Section # E

Re: write and call UDF in spark dataframe

2016-07-20 Thread Mich Talebzadeh
something similar def ChangeToDate (word : String) : Date = { //return TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(word,"dd/MM/"),"-MM-dd")) val d1 = Date.valueOf(ReverseDate(word)) return d1 } sqlContext.udf.register("ChangeToDate", ChangeToDate(_:String)) Dr Mich Talebzadeh LinkedIn *

getting null when calculating time diff with unix_timestamp + spark 1.6

2016-07-20 Thread Divya Gehlot
Hi, val lags=sqlContext.sql("select *,(unix_timestamp(time1,'$timeFmt') - lag(unix_timestamp(time2,'$timeFmt'))) as time_diff from df_table"); Instead of time difference in seconds I am gettng null . Would reay appreciate the help. Thanks, Divya

Re: run spark apps in linux crontab

2016-07-20 Thread Chanh Le
you should you use command.sh | tee file.log > On Jul 21, 2016, at 10:36 AM, > wrote: > > > thank you focus, and all. > this problem solved by adding a line ". /etc/profile" in my shell. > > > > > Thanks&Best regards! > San.Luo > > - 原始邮件 - > 发件人:

回复:Re:run spark apps in linux crontab

2016-07-20 Thread luohui20001
thank you focus, and all.this problem solved by adding a line ". /etc/profile" in my shell. Thanks&Best regards! San.Luo - 原始邮件 - 发件人:"focus" 收件人:"luohui20001" , "user@spark.apache.org" 主题:Re:run spark apps in linux crontab 日期:2016年07月20日 18点11分

Re: XLConnect in SparkR

2016-07-20 Thread Felix Cheung
>From looking at be CLConnect package, its loadWorkbook() function only >supports reading from local file path, so you might need a way to call HDFS >command to get the file from HDFS first. SparkR currently does not support this - you could read it in as a text file (I don't think .xlsx is a t

Re: Understanding spark concepts cluster, master, slave, job, stage, worker, executor, task

2016-07-20 Thread Taotao.Li
Hi, Sachin, here are two posts about the basic concepts about spark: - spark-questions-concepts - deep-into-spark-exection-model And, I fully recommend da

Re: write and call UDF in spark dataframe

2016-07-20 Thread Divya Gehlot
Hi , To be very specific I am looking for UDFs syntax for example which takes String as parameter and returns integer .. how do we define the return type . Thanks, Divya On 21 July 2016 at 00:24, Andy Davidson wrote: > Hi Divya > > In general you will get better performance if you can minimiz

Re: Role-based S3 access outside of EMR

2016-07-20 Thread Everett Anderson
Thanks, Andy. I am indeed often doing something similar, now -- copying data locally rather than dealing with the S3 impl selection and AWS credentials issues. It'd be nice if it worked a little easier out of the box, though! On Tue, Jul 19, 2016 at 2:47 PM, Andy Davidson < a...@santacruzintegra

Re: Subquery in having-clause (Spark 1.1.0)

2016-07-20 Thread rickn
Seeing the same results on the current 1.62 release ... just wanted to confirm. Are there any work arounds? Do I need to wait for 2.0 for support ? https://issues.apache.org/jira/browse/SPARK-12543 Thank you -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble

Re: HiveThriftServer and spark.sql.hive.thriftServer.singleSession setting

2016-07-20 Thread Chang Lim
It's an issue with the preview build. Switched to RC5 and all is working. Thanks to Michael Armbrust. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/HiveThriftServer-and-spark-sql-hive-thriftServer-singleSession-setting-tp27340p27379.html Sent from the Apa

Re: MultiThreading in Spark 1.6.0

2016-07-20 Thread Maciej Bryński
RK Aduri, Another idea is to union all results and then run collect. The question is how big collected data is. 2016-07-20 20:32 GMT+02:00 RK Aduri : > Spark version: 1.6.0 > So, here is the background: > > I have a data frame (Large_Row_DataFrame) which I have created from an > array of r

Re: PySpark 2.0 Structured Streaming Question

2016-07-20 Thread Tathagata Das
foreachWriter is not currently available in the python. we dont have a clear plan yet on when foreachWriter will be available in Python. On Wed, Jul 20, 2016 at 1:22 PM, A.W. Covert III wrote: > Hi All, > > I've been digging into spark 2.0, I have some streaming jobs running well > on YARN, and

Re: Little idea needed

2016-07-20 Thread Aakash Basu
Thanks for the detailed description buddy. But this will actually be done through NiFi (End to End) so we need to add the delta logic inside NiFi to automate the whole process. That's why, need a good (best) solution to solve this problem. Since, this is a classic issue which we can face any compa

Re: Little idea needed

2016-07-20 Thread Aakash Basu
Your second point: That's going to be a bottleneck for all the programs which will fetch the data from that folder and again add extra filters into the DF. I want to finish that off, there itself. And that merge logic is weak when one table is huge and the other is very small (which is the case he

PySpark 2.0 Structured Streaming Question

2016-07-20 Thread A.W. Covert III
Hi All, I've been digging into spark 2.0, I have some streaming jobs running well on YARN, and I'm working on some Spark Structured Streaming jobs now. I have a couple of jobs I'd like to move to Structured Streaming with the `foreachWriter` but it's not available in PySpark yet. Is it just becau

SparkWebUI and Master URL on EC2

2016-07-20 Thread KhajaAsmath Mohammed
Hi, I got an access to spark cluser and have intstatiated spark-shell on aws using command $spark-shell. Spark shell is started successfully but I am looking to access WebUI and Master URL. does anyone know how to access that in AWS. I tried http://IPMaster:4040 and http://IpMaster:8080 but it d

Re: ML PipelineModel to be scored locally

2016-07-20 Thread Peyman Mohajerian
One option is to save the model in parquet or json format and then build your own prediction code. Some also use: https://github.com/jpmml/jpmml-sparkml It depends on the model, e.g. ml v mllib and other factors whether this works on or not. Couple of weeks ago there was a long discussion on this

Using multiple data sources in one stream

2016-07-20 Thread Joe Panciera
Hi, I have a rather complicated situation thats raised an issue regarding consuming multiple data sources for processing. Unlike the use cases I've found, I have 3 sources of different formats. There's one 'main' stream A that does the processing, and 2 sources B and C that provide elements requir

Re: HiveThriftServer and spark.sql.hive.thriftServer.singleSession setting

2016-07-20 Thread Chang Lim
Would appreciate if someone: 1. Can confirm if this is an issue or 2. Share on how to get HiveThriftServer2.startWithContext working with shared temp table. I am using Beeline as the JDBC client to access the temp tables of the running Spark app. -- View this message in context: http://apache

MultiThreading in Spark 1.6.0

2016-07-20 Thread RK Aduri
Spark version: 1.6.0 So, here is the background: I have a data frame (Large_Row_DataFrame) which I have created from an array of row objects and also have another array of unique ids (U_ID) which I’m going to use to look up into the Large_Row_DataFrame (which is cached) to do a customized

Re: Saving a pyspark.ml.feature.PCA model

2016-07-20 Thread Ajinkya Kale
Just found Google dataproc has a preview of spark 2.0. Tried it and save/load works! Thanks Shuai. Followup question - is there a way to export the pyspark.ml models to PMML ? If not, what is the best way to integrate the model for inference in a production service ? On Tue, Jul 19, 2016 at 8:22 P

Re: Building standalone spark application via sbt

2016-07-20 Thread Sachin Mittal
I got the error during run time. It was for mongo-spark-connector class files. My build.sbt is like this name := "Test Advice Project" version := "1.0" scalaVersion := "2.10.6" libraryDependencies ++= Seq( "org.mongodb.spark" %% "mongo-spark-connector" % "1.0.0", "org.apache.spark" %% "spa

Re: Building standalone spark application via sbt

2016-07-20 Thread Marco Mistroni
that will work but ideally you should not include any of the spark-releated jars as they are provided to you by the spark environment whenever you launch your app via spark-submit (this will prevent unexpected errors e.g. when you kick off your app using a different version of spark where some of

Attribute name "sum(proceeds)" contains invalid character(s) among " ,;{}()\n\t="

2016-07-20 Thread Chanh Le
Hi everybody, I got a error about the name of the columns is not following the rule. Please tell me the way to fix it. Here is my code metricFields Here is a Seq of metrics: spent, proceed, click, impression sqlContext .sql(s"select * from hourly where time between '$dateStr-00' and '$dateStr

Re: Understanding spark concepts cluster, master, slave, job, stage, worker, executor, task

2016-07-20 Thread Jean Georges Perrin
Hey, I love when questions are numbered, it's easier :) 1) Yes (but I am not an expert) 2) You don't control... One of my process is going to 8k tasks, so... 3) Yes, if you have HT, it double. My servers have 12 cores, but HT, so it makes 24. 4) From my understanding: Slave is the logical comput

Re: OutOfMemory when doing joins in spark 2.0 while same code runs fine in spark 1.5.2

2016-07-20 Thread Ian O'Connell
Ravi did your issue ever get solved for this? I think i've been hitting the same thing, it looks like the spark.sql.autoBroadcastJoinThreshold stuff isn't kicking in as expected, if I set that to -1 then the computation proceeds successfully. On Tue, Jun 14, 2016 at 12:28 AM, Ravi Aggarwal wrote

Understanding spark concepts cluster, master, slave, job, stage, worker, executor, task

2016-07-20 Thread Sachin Mittal
Hi, I was able to build and run my spark application via spark submit. I have understood some of the concepts by going through the resources at https://spark.apache.org but few doubts still remain. I have few specific questions and would be glad if someone could share some light on it. So I submi

Re: RandomForestClassifier

2016-07-20 Thread Marco Mistroni
Hi afaik yes (other pls override ). Generally, in RandomForest and DecisionTree you have a column which you are trying to 'predict' (the label) and a set of features that are used to predict the outcome. i would assume that if you specify thelabel column and the 'features' columns, everything else

Re: Building standalone spark application via sbt

2016-07-20 Thread Sachin Mittal
NoClassDefFound error was for spark classes like say SparkConext. When running a standalone spark application I was not passing external jars using --jars option. However I have fixed this by making a fat jar using sbt assembly plugin. Now all the dependencies are included in that jar and I use t

RandomForestClassifier

2016-07-20 Thread pseudo oduesp
hi , we have parmaters named labelCol="labe" ,featuresCol="features", when i precise the value here (label and features) if train my model on data frame with other columns tha algorithme choos only label columns and features columns ? thanks

Re: Spark driver getting out of memory

2016-07-20 Thread RK Aduri
Cache defaults to MEMORY_ONLY. Can you try with different storage levels ,i.e., MEMORY_ONLY_SER or even DISK_ONLY. you may want to use persist( ) instead of cache. Or there is an experimental storage level OFF_HEAP which might also help. On Tue, Jul 19, 2016 at 11:08 PM, Saurav Sinha wrote: > H

Storm HDFS bolt equivalent in Spark Streaming.

2016-07-20 Thread Rajesh_Kalluri
Dell - Internal Use - Confidential While writing to Kafka from Storm, the hdfs bolt provides a nice way to batch the messages , rotate files, file name convention etc as shown below. Do you know of something similar in Spark Streaming or do we have to roll our own? If anyone attempted this can y

How to connect HBase and Spark using Python?

2016-07-20 Thread Def_Os
I'd like to know whether there's any way to query HBase with Spark SQL via the PySpark interface. See my question on SO: http://stackoverflow.com/questions/38470114/how-to-connect-hbase-and-spark-using-python The new HBase-Spark module in HBase, which introduces the HBaseContext/JavaHBaseContext,

Re: write and call UDF in spark dataframe

2016-07-20 Thread Andy Davidson
Hi Divya In general you will get better performance if you can minimize your use of UDFs. Spark 2.0/ tungsten does a lot of code generation. It will have to treat your UDF as a block box. Andy From: Rishabh Bhardwaj Date: Wednesday, July 20, 2016 at 4:22 AM To: Rabin Banerjee Cc: Divya Geh

Re: Building standalone spark application via sbt

2016-07-20 Thread Marco Mistroni
Hello Sachin pls paste the NoClassDefFound Exception so we can see what's failing, aslo please advise how are you running your Spark App For an extremely simple case, let's assume you have your MyFirstSparkApp packaged in your myFirstSparkApp.jar Then all you need to do would be to kick off

difference between two consecutive rows of same column + spark + dataframe

2016-07-20 Thread Divya Gehlot
Hi, I have a dataset of time as shown below : Time1 07:30:23 07:34:34 07:38:23 07:39:12 07:45:20 I need to find the diff between two consecutive rows I googled and found the *lag *function in *spark *helps in finding it . but its not giving me *null *in the result set. Would really appreciate th

Best practices to restart Spark jobs programatically from driver itself

2016-07-20 Thread unk1102
Hi I have multiple long running spark jobs which many times hangs because of multi tenant Hadoop cluster and resource scarcity. I am thinking of restarting spark job within driver itself. For e.g. if spark job does not write output files for say 30 minutes then I want to restart spark job by itself

Re: spark worker continuously trying to connect to master and failed in standalone mode

2016-07-20 Thread Igor Berman
in addition check what ip the master is binding to(with nestat) On 20 July 2016 at 06:12, Andrew Ehrlich wrote: > Troubleshooting steps: > > $ telnet localhost 7077 (on master, to confirm port is open) > $ telnet 7077 (on slave, to confirm port is blocked) > > If the port is available on the ma

Re: Is it good choice to use DAO to store results generated by spark application?

2016-07-20 Thread Yu Wei
This is startup project. We don't know how much data will be written everyday. Definitely, there is not too much data at the beginning. But data will increase later. And we want to use spark streaming to receive data via MQTT Util. We're now evaluate which components could be used for storing d

Re: Building standalone spark application via sbt

2016-07-20 Thread Mich Talebzadeh
you need an uber jar file. Have you actually followed the dependencies and project sub-directory build? check this. http://stackoverflow.com/questions/28459333/how-to-build-an-uber-jar-fat-jar-using-sbt-within-intellij-idea under three answers the top one. I started reading the official SBT tu

Re: Is it good choice to use DAO to store results generated by spark application?

2016-07-20 Thread Ted Yu
You can decide which component(s) to use for storing your data. If you haven't used hbase before, it may be better to store data on hdfs and query through Hive or SparkSQL. Maintaining hbase is not trivial task, especially when the cluster size is large. How much data are you expecting to be writ

Re: Is it good choice to use DAO to store results generated by spark application?

2016-07-20 Thread Yu Wei
I'm beginner to big data. I don't have too much knowledge about hbase/hive. What's the difference between hbase and hive/hdfs for storing data for analytics? Thanks, Jared From: ayan guha Sent: Wednesday, July 20, 2016 9:34:24 PM To: Rabin Banerjee Cc: user;

Re: Latest 200 messages per topic

2016-07-20 Thread Cody Koeninger
If they're files in a file system, and you don't actually need multiple kinds of consumers, have you considered streamingContext.fileStream instead of kafka? On Wed, Jul 20, 2016 at 5:40 AM, Rabin Banerjee wrote: > Hi Cody, > > Thanks for your reply . > >Let Me elaborate a bit,We have a D

ML PipelineModel to be scored locally

2016-07-20 Thread Simone Miraglia
Hi all, I am working on the following use case involving ML Pipelines. 1. I created a Pipeline composed from a set of stages 2. I called "fit" method on my training set 3. I validated my model by calling "transform" on my test set 4. I stored my fitted Pipeline to a shared folder Then I have a v

Snappy initialization issue, spark assembly jar missing snappy classes?

2016-07-20 Thread Eugene Morozov
Greetings! We're reading input files with newApiHadoopFile that is configured with multiline split. Everything's fine, besides https://issues.apache.org/jira/browse/MAPREDUCE-6549. It looks like the issue is fixed, but within hadoop 2.7.2. Which means we have to download spark without hadoop and p

Re: Spark Job trigger in production

2016-07-20 Thread Sathish Kumaran Vairavelu
If you are using Mesos, then u can use Chronos or Marathon On Wed, Jul 20, 2016 at 6:08 AM Rabin Banerjee wrote: > ++ crontab :) > > On Wed, Jul 20, 2016 at 9:07 AM, Andrew Ehrlich > wrote: > >> Another option is Oozie with the spark action: >> https://oozie.apache.org/docs/4.2.0/DG_SparkActionE

Spark 1.6.2 Spark-SQL RACK_LOCAL

2016-07-20 Thread chandana
Hive - 1.2.1 AWS EMR 4.7.2 I have external tables with partitions from s3. I had some good performance with Spark 1.6.1 with NODE_LOCAL data 7x compared to RACK_LOCAL data. With Spark 1.6.2 and AWS EMR 4.7.2, my node locality is 0! Rack locality 100%. I am using the default settings and didn't c

Re: Is it good choice to use DAO to store results generated by spark application?

2016-07-20 Thread ayan guha
Just as a rain check, saving data to hbase for analytics may not be the best choice. Any specific reason for not using hdfs or hive? On 20 Jul 2016 20:57, "Rabin Banerjee" wrote: > Hi Wei , > > You can do something like this , > > foreachPartition( (part) => {val conn = > ConnectionFactory.c

lift coefficien

2016-07-20 Thread pseudo oduesp
Hi , how we can claculate lift coeff from pyspark result of prediction ? thanks ?

Re: Little idea needed

2016-07-20 Thread Mich Talebzadeh
In reality a true real time analytics will require interrogating the transaction (redo) log of the RDBMS database to see for changes. An RDBMS will only keep on current record (the most recent) so if record is deleted since last import into HDFS that record will not exist. If the record has been

Re: write and call UDF in spark dataframe

2016-07-20 Thread Mich Talebzadeh
yep something in line of val df = sqlContext.sql("SELECT FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss') as time ") Note that this does not require a column from an already existing table. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianr

Re: write and call UDF in spark dataframe

2016-07-20 Thread Rishabh Bhardwaj
Hi Divya, There is already "from_unixtime" exists in org.apache.spark.sql.frunctions, Rabin has used that in the sql query,if you want to use it in dataframe DSL you can try like this, val new_df = df.select(from_unixtime($"time").as("newtime")) Thanks, Rishabh. On Wed, Jul 20, 2016 at 4:21 PM

Re: Spark Job trigger in production

2016-07-20 Thread Rabin Banerjee
++ crontab :) On Wed, Jul 20, 2016 at 9:07 AM, Andrew Ehrlich wrote: > Another option is Oozie with the spark action: > https://oozie.apache.org/docs/4.2.0/DG_SparkActionExtension.html > > On Jul 18, 2016, at 12:15 AM, Jagat Singh wrote: > > You can use following options > > * spark-submit from

Re: Storm HDFS bolt equivalent in Spark Streaming.

2016-07-20 Thread Rabin Banerjee
++Deepak, There is also a option to use saveAsHadoopFile & saveAsNewAPIHadoopFile, In which you can customize(filename and many things ...) the way you want to save it. :) Happy Sparking Regards, Rabin Banerjee On Wed, Jul 20, 2016 at 10:01 AM, Deepak Sharma wrote: > In spark streaming ,

Re: Running multiple Spark Jobs on Yarn( Client mode)

2016-07-20 Thread Rabin Banerjee
Hi Vaibhav, Please check your yarn configuration and make sure you have available resources .Please try creating multiple queues ,And submit job on queues. --queue thequeue Regards, Rabin Banerjee On Wed, Jul 20, 2016 at 12:05 PM, vaibhavrtk wrote: > I have a silly question: > > Do multiple sp

Re: Is it good choice to use DAO to store results generated by spark application?

2016-07-20 Thread Rabin Banerjee
Hi Wei , You can do something like this , foreachPartition( (part) => {val conn = ConnectionFactory.createConnection(HBaseConfiguration.create()); val table = conn.getTable(TableName.valueOf(tablename)); //part.foreach((inp)=>{println(inp);table.put(inp)}) //This is line by line putta

Re: write and call UDF in spark dataframe

2016-07-20 Thread Rabin Banerjee
Hi Divya , Try, val df = sqlContext.sql("select from_unixtime(ts,'-MM-dd') as `ts` from mr") Regards, Rabin On Wed, Jul 20, 2016 at 12:44 PM, Divya Gehlot wrote: > Hi, > Could somebody share example of writing and calling udf which converts > unix tme stamp to date tiime . > > > Thanks, >

Re: XLConnect in SparkR

2016-07-20 Thread Rabin Banerjee
Hi Yogesh , I have never tried reading XLS files using Spark . But I think you can use sc.wholeTextFiles to read the complete xls at once , as xls files are xml internally, you need to read them all to parse . Then I think you can use apache poi to read them . Also, you can copy you XLS data t

Re: run spark apps in linux crontab

2016-07-20 Thread Rabin Banerjee
HI , Please check your deploy mode and master , For example if you want to deploy in yarn cluster you should use --master yarn-cluster , if you want to do it on yarn client mode you should use --master yarn-client . Please note for your case deploying yarn-cluster will be better as cluster mode

Re: Latest 200 messages per topic

2016-07-20 Thread Rabin Banerjee
Hi Cody, Thanks for your reply . Let Me elaborate a bit,We have a Directory where small xml(90 KB) files are continuously coming(pushed from other node).File has ID & Timestamp in name and also inside record. Data coming in the directory has to be pushed to Kafka to finally get into Spar

Re:run spark apps in linux crontab

2016-07-20 Thread focus
Hi, I just meet this problem, too! The reason is crontab runtime doesn't have the variables you defined, such as $SPARK_HOME. I defined the $SPARK_HOME and other variables in /etc/profile like this: export $MYSCRIPTS=/opt/myscripts export $SPARK_HOME=/opt/spark then, in my crontab job script d

RE: run spark apps in linux crontab

2016-07-20 Thread Joaquin Alzola
Remember that the you need to souce your .bashrc For your PATH to be set up. From: luohui20...@sina.com [mailto:luohui20...@sina.com] Sent: 20 July 2016 11:01 To: user Subject: run spark apps in linux crontab hi guys: I add a spark-submit job into my Linux crontab list by the means below

run spark apps in linux crontab

2016-07-20 Thread luohui20001
hi guys: I add a spark-submit job into my Linux crontab list by the means below ,however none of them works. If I change it to a normal shell script, it is ok. I don't quite understand why. I checked the 8080 web ui of my spark cluster, no job submitted, and there is not messages in /home/h

Re: Building standalone spark application via sbt

2016-07-20 Thread Sachin Mittal
Hi, I am following the example under https://spark.apache.org/docs/latest/quick-start.html For standalone scala application. I added all my dependencies via build.sbt (one dependency is under lib folder). When I run sbt package I see the jar created under target/scala-2.10/ So compile seems to b

XLConnect in SparkR

2016-07-20 Thread Yogesh Vyas
Hi, I am trying to load and read excel sheets from HDFS in sparkR using XLConnect package. Can anyone help me in finding out how to read xls files from HDFS in sparkR ? Regards, Yogesh - To unsubscribe e-mail: user-unsubscr...@s

How spark decides whether to do BroadcastHashJoin or SortMergeJoin

2016-07-20 Thread raaggarw
Hi, How spark decides/optimizes internally as to when it needs to a BroadcastHashJoin vs SortMergeJoin? Is there anyway we can guide from outside or through options which Join to use? Because in my case when i am trying to do a join, spark makes that join as BroadCastHashJoin internally and when j

write and call UDF in spark dataframe

2016-07-20 Thread Divya Gehlot
Hi, Could somebody share example of writing and calling udf which converts unix tme stamp to date tiime . Thanks, Divya