RE: Not able to write output to local filsystem from Standalone mode.

2016-05-24 Thread Stuti Awasthi
Thanks Mathieu, So either I must have shared filesystem OR Hadoop as filesystem in order to write data from Standalone mode cluster setup environment. Thanks for your input. Regards Stuti Awasthi From: Mathieu Longtin [math...@closetwork.org] Sent: Tuesday, May 24, 2016 7:34 PM To: Stuti

Using Java in Spark shell

2016-05-24 Thread Ashok Kumar
Hello, A newbie question. Is it possible to use java code directly in spark shell without using maven to build a jar file? How can I switch from scala to java in spark shell? Thanks

job build cost more and more time

2016-05-24 Thread naliazheli
i am using spark1.6 and noticed time between jobs get longer,sometimes it could be 20 mins. i tried to search same questions ,and found a close one : http://apache-spark-user-list.1001560.n3.nabble.com/Spark-app-gets-slower-as-it-gets-executed-more-times-td1089.html#a1146 and found something

Re: Dataset Set Operations

2016-05-24 Thread Michael Armbrust
What is the schema of the case class? On Tue, May 24, 2016 at 3:46 PM, Tim Gautier wrote: > Hello All, > > I've been trying to subtract one dataset from another. Both datasets > contain case classes of the same type. When I subtract B from A, I end up > with a copy of A

Dataset Set Operations

2016-05-24 Thread Tim Gautier
Hello All, I've been trying to subtract one dataset from another. Both datasets contain case classes of the same type. When I subtract B from A, I end up with a copy of A that still has the records of B in it. (An intersection of A and B always results in 0 results.) All I can figure is that

Re: How does Spark set task indexes?

2016-05-24 Thread Ted Yu
Have you taken a look at SPARK-14915 ? On Tue, May 24, 2016 at 1:00 PM, Adrien Mogenet < adrien.moge...@contentsquare.com> wrote: > Hi, > > I'm wondering how Spark is setting the "index" of task? > I'm asking this question because we have a job that constantly fails at > task index = 421. > >

Error while saving plots

2016-05-24 Thread njoshi
For an analysis app, I have to make ROC curves on the fly and save to disc. I am using scala-chart for this purpose and doing the following in my Spark app: val rocs = performances.map{case (id, (auRoc, roc)) => (id, roc.collect().toList)} XYLineChart(rocs.toSeq, title = "Pooled Data

Re: Spark-submit hangs indefinitely after job completion.

2016-05-24 Thread Pradeep Nayak
BTW, I am using a 6-node cluster with m4.2xlarge machines on amazon. I have tried with both yarn-cluster and spark's native cluster mode as well. On Tue, May 24, 2016 at 12:10 PM Mathieu Longtin wrote: > I have been seeing the same behavior in standalone with a master. >

How does Spark set task indexes?

2016-05-24 Thread Adrien Mogenet
Hi, I'm wondering how Spark is setting the "index" of task? I'm asking this question because we have a job that constantly fails at task index = 421. When increasing number of partitions, this then fails at index=4421. Increase it a little bit more, now it's 24421. Our job is as simple as "(1)

Error publishing to spark-packages

2016-05-24 Thread Neville Li
Hi guys, I built a spark package but couldn't publish them with sbt-spark-package plugin. Any idea why these are failing? http://spark-packages.org/staging?id=1179 http://spark-packages.org/staging?id=1168 Repo: https://github.com/spotify/spark-bigquery Jars are published to Maven:

Re: Maintain kafka offset externally as Spark streaming processes records.

2016-05-24 Thread Cody Koeninger
Have you looked at everything linked from https://github.com/koeninger/kafka-exactly-once On Tue, May 24, 2016 at 2:07 PM, sagarcasual . wrote: > In spark streaming consuming kafka using KafkaUtils.createDirectStream, > there are examples of the kafka offset level

Re: Spark-submit hangs indefinitely after job completion.

2016-05-24 Thread Mathieu Longtin
I have been seeing the same behavior in standalone with a master. On Tue, May 24, 2016 at 3:08 PM Pradeep Nayak wrote: > > > I have posted the same question of Stack Overflow: >

Spark-submit hangs indefinitely after job completion.

2016-05-24 Thread Pradeep Nayak
I have posted the same question of Stack Overflow: http://stackoverflow.com/questions/37421852/spark-submit-continues-to-hang-after-job-completion I am trying to test spark 1.6 with hdfs in AWS. I am using the wordcount python example available in the examples folder. I submit the job with

Maintain kafka offset externally as Spark streaming processes records.

2016-05-24 Thread sagarcasual .
In spark streaming consuming kafka using KafkaUtils.createDirectStream, there are examples of the kafka offset level ranges. However if 1. I would like periodically maintain offset level so that if needed I can reprocess items from a offset. Is there any way I can retrieve offset of a message in

Re: How to read *.jhist file in Spark using scala

2016-05-24 Thread Miles
Instead of reading *.jhist files direclty in Spark, you could convert your .jhist files into Json and then read Json files in Spark. Here's a post on converting .jhist file to json format. http://stackoverflow.com/questions/32683907/converting-jhist-files-to-json-format -- View this message in

Re: spark streaming: issue with logging with separate log4j properties files for driver and executor

2016-05-24 Thread chandan prakash
Resolved. Used passing parameters in sparkConf instead of passing to spark-submit command : (still dont know why passing to spark-submit command did not work) sparkConf.set("spark.executor.extraJavaOptions", "-XX:+UseG1GC -Dlog4j.configuration=file:log4j_RequestLogExecutor.properties ") On

Re: Spark Streaming with Kafka

2016-05-24 Thread Rasika Pohankar
Hi firemonk9, Sorry, its been too long but I just saw this. I hope you were able to resolve it. FWIW, we were able to solve this with the help of the Low Level Kafka Consumer, instead of the inbuilt Kafka consumer in Spark, from here: https://github.com/dibbhatt/kafka-spark-consumer/.

Re: spark streaming: issue with logging with separate log4j properties files for driver and executor

2016-05-24 Thread chandan prakash
Any suggestion? On Mon, May 23, 2016 at 5:18 PM, chandan prakash wrote: > Hi, > I am able to do logging for driver but not for executor. > > I am running spark streaming under mesos. > Want to do log4j logging separately for driver and executor. > > Used the below

Possible bug involving Vectors with a single element

2016-05-24 Thread flyinggip
Hi there, I notice that there might be a bug in pyspark.mllib.linalg.Vectors when dealing with a vector with a single element. Firstly, the 'dense' method says it can also take numpy.array. However the code uses 'if len(elements) == 1' and when a numpy.array has only one element its length is

Re: Not able to write output to local filsystem from Standalone mode.

2016-05-24 Thread Mathieu Longtin
In standalone mode, executor assume they have access to a shared file system. The driver creates the directory and the executor write files, so the executors end up not writing anything since there is no local directory. On Tue, May 24, 2016 at 8:01 AM Stuti Awasthi wrote:

Re: Hive_context

2016-05-24 Thread Ajay Chander
Hi Arun, Thanks for your time. I was able to connect through JDBC Java client. But I am not able to connect from my spark application. You think I missed any configuration step with in the code? Somehow my application is not picking up hive-site.xml from my machine, I put it under the class path

Re: Using HiveContext.set in multipul threads

2016-05-24 Thread Silvio Fiorito
If you’re using DataFrame API you can achieve that by simply using (or not) the “partitionBy” method on the DataFrameWriter: val originalDf = …. val df1 = originalDf…. val df2 = originalDf… df1.write.partitionBy(”col1”).save(…) df2.write.save(…) From: Amir Gershman Date:

RE: Not able to write output to local filsystem from Standalone mode.

2016-05-24 Thread Stuti Awasthi
hi Jacek, Parent directory already present, its my home directory. Im using Linux (Redhat) machine 64 bit. Also I noticed that "test1" folder is created in my master with subdirectory as "_temporary" which is empty. but on slaves, no such directory is created under /home/stuti. Thanks Stuti 

Re: Not able to write output to local filsystem from Standalone mode.

2016-05-24 Thread Jacek Laskowski
Hi, What happens when you create the parent directory /home/stuti? I think the failure is due to missing parent directories. What's the OS? Jacek On 24 May 2016 11:27 a.m., "Stuti Awasthi" wrote: Hi All, I have 3 nodes Spark 1.6 Standalone mode cluster with 1 Master and

Re: How to run hive queries in async mode using spark sql

2016-05-24 Thread Mich Talebzadeh
fine give me an example where you have tried to turn on async for the query using spark sql. Your actual code. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: How to run hive queries in async mode using spark sql

2016-05-24 Thread Raju Bairishetti
Hi Mich, Thanks for the response. yes, I do not want to block until the hive query is completed and want to know is there any way to poll the status/progress of submitted query. I can turn on asyc mode for hive queries in spark sql but how to track the status of the submitted query? On Tue,

Using HiveContext.set in multipul threads

2016-05-24 Thread Amir Gershman
Hi, I have a DataFrame I compute from a long chain of transformations. I cache it, and then perform two additional transformations on it. I use two Futures - each Future will insert the content of one of the above Dataframe to a different hive table. One Future must SET

Re: How to run hive queries in async mode using spark sql

2016-05-24 Thread Mich Talebzadeh
By Hive queries in async mode, you mean submitting sql queries to Hive and move on to the next operation and wait for return of result set from Hive? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: How to run hive queries in async mode using spark sql

2016-05-24 Thread Raju Bairishetti
Any thoughts on this? In hive, it returns operation handle. This handle can be used for fetching the status of query. Is there any similar mechanism in spark sql? Looks like almost all the methods in the HiveContext are either protected or private. On Wed, May 18, 2016 at 9:03 AM, Raju

Not able to write output to local filsystem from Standalone mode.

2016-05-24 Thread Stuti Awasthi
Hi All, I have 3 nodes Spark 1.6 Standalone mode cluster with 1 Master and 2 Slaves. Also Im not having Hadoop as filesystem . Now, Im able to launch shell , read the input file from local filesystem and perform transformation successfully. When I try to write my output in local filesystem path

Re: Spark JOIN Not working

2016-05-24 Thread Alonso Isidoro Roman
Could you share a bit of the dataset? difficult to test without data... Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman 2016-05-24 8:43 GMT+02:00 Aakash Basu

Re: Spark Streaming with Redis

2016-05-24 Thread Pariksheet Barapatre
Thanks Sachin. Link that you mentioned uses native connection library JedisPool. I am looking if I can use https://github.com/RedisLabs/spark-redis for same functionality. Regards Pari On 24 May 2016 at 13:33, Sachin Aggarwal wrote: > Hi, > > yahoo benchmark uses

Re: Spark Streaming with Redis

2016-05-24 Thread Sachin Aggarwal
Hi, yahoo benchmark uses redis with spark, have a look at this https://github.com/yahoo/streaming-benchmarks/blob/master/spark-benchmarks/src/main/scala/AdvertisingSpark.scala On Tue, May 24, 2016 at 1:28 PM, Pariksheet Barapatre wrote: > Hello All, > > I am trying to

Spark Streaming with Redis

2016-05-24 Thread Pariksheet Barapatre
Hello All, I am trying to use Redis as a data store for one of sensor data use cases and I am fairly new to Redis. I guess spark-redis module can help me, but I am not getting how to use INCR or HINCRBY redis functions. Could you please help me to get some example codes or any pointers to solve

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-24 Thread Mich Talebzadeh
Hi, We use Hive as the database and use Spark as an all purpose query tool. Whether Hive is the write database for purpose or one is better off with something like Phoenix on Hbase, well the answer is it depends and your mileage varies. So fit for purpose. Ideally what wants is to use the

Spark JOIN Not working

2016-05-24 Thread Aakash Basu
Hi experts, I'm extremely new to the Spark Ecosystem, hence need a help from you guys. While trying to fetch data from CSV files and join querying them in accordance to the need, when I'm caching the data by using registeredTempTables and then using select query to select what I need as per the