Re: is it ok to make I/O calls in UDF? other words is it a standard practice ?

2018-04-23 Thread Sathish Kumaran Vairavelu
I have made simple rest call within UDF and it worked but not sure if it
can be applied for large datasets but may be for small lookup files. Thanks
On Mon, Apr 23, 2018 at 4:28 PM kant kodali  wrote:

> Hi All,
>
> Is it ok to make I/O calls in UDF? other words is it a standard practice?
>
> Thanks!
>


Re: Spark querying C* in Scala

2018-01-22 Thread Sathish Kumaran Vairavelu
You have to register a Cassandra table in spark as dataframes


https://github.com/datastax/spark-cassandra-connector/blob/master/doc/14_data_frames.md


Thanks

Sathish
On Mon, Jan 22, 2018 at 7:43 AM Conconscious  wrote:

> Hi list,
>
> I have a Cassandra table with two fields; id bigint, kafka text
>
> My goal is to read only the kafka field (that is a JSON) and infer the
> schema
>
> Hi have this skeleton code (not working):
>
> sc.stop
> import org.apache.spark._
> import com.datastax.spark._
> import org.apache.spark.sql.functions.get_json_object
>
> import org.apache.spark.sql.functions.to_json
> import org.apache.spark.sql.functions.from_json
> import org.apache.spark.sql.types._
>
> val conf = new SparkConf(true)
> .set("spark.cassandra.connection.host", "127.0.0.1")
> .set("spark.cassandra.auth.username", "cassandra")
> .set("spark.cassandra.auth.password", "cassandra")
> val sc = new SparkContext(conf)
>
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> val df = sqlContext.sql("SELECT kafka from table1")
> df.printSchema()
>
> I think at least I have two problems; is missing the keyspace, is not
> recognizing the table and for sure is not going to infer the schema from
> the text field.
>
> I have a working solution for json files, but I can't "translate" this
> to Cassandra:
>
> import org.apache.spark.sql.SparkSession
> import spark.implicits._
> val spark = SparkSession.builder().appName("Spark SQL basic
> example").getOrCreate()
> val redf = spark.read.json("/usr/local/spark/examples/cqlsh_r.json")
> redf.printSchema
> redf.count
> redf.show
> redf.createOrReplaceTempView("clicks")
> val clicksDF = spark.sql("SELECT * FROM clicks")
> clicksDF.show()
>
> My Spark version is 2.2.1 and Cassandra version is 3.11.1
>
> Thanks in advance
>
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: PySpark - Expand rows into dataframes via function

2017-10-03 Thread Sathish Kumaran Vairavelu
Flatmap works too.. Explode function is a SQL/Dataframe way of one to many
operation. Both should work. Thanks
On Tue, Oct 3, 2017 at 8:30 AM Patrick McCarthy <pmccar...@dstillery.com>
wrote:

> Thanks Sathish.
>
> Before you responded, I came up with this solution:
>
> # A function to take in one row and return the expanded ranges:
> def processRow(x):
>
> ...
> return zip(list_of_ip_ranges, list_of_registry_ids)
>
> # and then in spark,
>
> processed_rdds = spark_df_of_input_data.rdd.flatMap(lambda x:
> processRow(x))
>
> processed_df =
> (processed_rdds.toDF().withColumnRenamed('_1','ip').withColumnRenamed('_2','registryid'))
>
> And then after that I split and subset the IP column into what I wanted.
>
> On Mon, Oct 2, 2017 at 7:52 PM, Sathish Kumaran Vairavelu <
> vsathishkuma...@gmail.com> wrote:
>
>> It's possible with array function combined with struct construct. Below
>> is a SQL example
>>
>> select Array(struct(ip1,hashkey), struct(ip2,hashkey))
>> from (select substr(col1,1,2) as ip1, substr(col1,3,3) as ip2, etc,
>> hashkey from object) a
>>
>> If you want dynamic ip ranges; you need to dynamically construct structs
>> based on the range values. Hope this helps.
>>
>>
>> Thanks
>>
>> Sathish
>>
>> On Mon, Oct 2, 2017 at 9:01 AM Patrick McCarthy <pmccar...@dstillery.com>
>> wrote:
>>
>>> Hello,
>>>
>>> I'm trying to map ARIN registry files into more explicit IP ranges. They
>>> provide a number of IPs in the range (here it's 8192) and a starting IP,
>>> and I'm trying to map it into all the included /24 subnets. For example,
>>>
>>> Input:
>>>
>>> array(['arin', 'US', 'ipv4', '23.239.160.0', 8192, 20131104.0, 'allocated',
>>>
>>>'ff26920a408f15613096aa7fe0ddaa57'], dtype=object)
>>>
>>>
>>> Output:
>>>
>>> array([['23', '239', '160', 'ff26920a408f15613096aa7fe0ddaa57'],
>>>['23', '239', '161', 'ff26920a408f15613096aa7fe0ddaa57'],
>>>['23', '239', '162', 'ff26920a408f15613096aa7fe0ddaa57'],
>>>
>>> ...
>>>
>>>
>>> I have the input lookup table in a pyspark DF, and a python function to do 
>>> the conversion into the mapped output. I think to produce the full mapping 
>>> I need a UDTF but this concept doesn't seem to exist in PySpark. What's the 
>>> best approach to do this mapping and recombine into a new DataFrame?
>>>
>>>
>>> Thanks,
>>>
>>> Patrick
>>>
>>>
>


Re: PySpark - Expand rows into dataframes via function

2017-10-02 Thread Sathish Kumaran Vairavelu
It's possible with array function combined with struct construct. Below is
a SQL example

select Array(struct(ip1,hashkey), struct(ip2,hashkey))
from (select substr(col1,1,2) as ip1, substr(col1,3,3) as ip2, etc, hashkey
from object) a

If you want dynamic ip ranges; you need to dynamically construct structs
based on the range values. Hope this helps.


Thanks

Sathish

On Mon, Oct 2, 2017 at 9:01 AM Patrick McCarthy 
wrote:

> Hello,
>
> I'm trying to map ARIN registry files into more explicit IP ranges. They
> provide a number of IPs in the range (here it's 8192) and a starting IP,
> and I'm trying to map it into all the included /24 subnets. For example,
>
> Input:
>
> array(['arin', 'US', 'ipv4', '23.239.160.0', 8192, 20131104.0, 'allocated',
>
>'ff26920a408f15613096aa7fe0ddaa57'], dtype=object)
>
>
> Output:
>
> array([['23', '239', '160', 'ff26920a408f15613096aa7fe0ddaa57'],
>['23', '239', '161', 'ff26920a408f15613096aa7fe0ddaa57'],
>['23', '239', '162', 'ff26920a408f15613096aa7fe0ddaa57'],
>
> ...
>
>
> I have the input lookup table in a pyspark DF, and a python function to do 
> the conversion into the mapped output. I think to produce the full mapping I 
> need a UDTF but this concept doesn't seem to exist in PySpark. What's the 
> best approach to do this mapping and recombine into a new DataFrame?
>
>
> Thanks,
>
> Patrick
>
>


Re: spark.write.csv is not able write files to specified path, but is writing to unintended subfolder _temporary/0/task_xxx folder on worker nodes

2017-08-11 Thread Sathish Kumaran Vairavelu
I think you can collect the results in driver through toLocalIterator
method of RDD and save the result to the driver program; rather than
writing it to the file on the local disk and collecting it separately. If
your data is small enough and if you have enough cores/memory try
processing everything in local mode and write the results locally.

-Sathish

On Fri, Aug 11, 2017 at 1:17 PM Steve Loughran 
wrote:

> On 10 Aug 2017, at 09:51, Hemanth Gudela 
> wrote:
>
> Yeah, installing HDFS in our environment is unfornutately going to take
> lot of time (approvals/planning etc). I will have to live with local FS for
> now.
> The other option I had already tried is collect() and send everything to
> driver node. But my data volume is too huge for driver node to handle alone.
>
>
> NFS cross mount.
>
>
> I’m now trying to split the data into multiple datasets, then collect
> individual dataset and write it to local FS on driver node (this approach
> slows down the spark job, but I hope it works).
>
>
>
> I doubt it. The job driver is in charge of committing work renaming data
> under _temporary into the right place. Every operation which calls write()
> to safe to an FS must have the same paths visible to all nodes in the spark
> cluster.
>
> A cluster-wide filesystem of some form is mandatory, or you abandon
> write() and implement your own operations to save (partitioned) data
>
>
> Thank you,
> Hemanth
>
> *From: *Femi Anthony 
> *Date: *Thursday, 10 August 2017 at 11.24
> *To: *Hemanth Gudela 
> *Cc: *"user@spark.apache.org" 
> *Subject: *Re: spark.write.csv is not able write files to specified path,
> but is writing to unintended subfolder _temporary/0/task_xxx folder on
> worker nodes
>
> Also, why are you trying to write results locally if you're not using a
> distributed file system ? Spark is geared towards writing to a distributed
> file system. I would suggest trying to collect() so the data is sent to the
> master and then do a write if the result set isn't too big, or repartition
> before trying to write (though I suspect this won't really help). You
> really should install HDFS if that is possible.
>
> Sent from my iPhone
>
>
> On Aug 10, 2017, at 3:58 AM, Hemanth Gudela 
> wrote:
>
> Thanks for reply Femi!
>
> I’m writing the file like this à myDataFrame.
> write.mode("overwrite").csv("myFilePath")
> There absolutely are no errors/warnings after the write.
>
> _SUCCESS file is created on master node, but the problem of _temporary is
> noticed only on worked nodes.
>
> I know spark.write.csv works best with HDFS, but with the current setup I
> have in my environment, I have to deal with spark write to node’s local
> file system and not to HDFS.
>
> Regards,
> Hemanth
>
> *From: *Femi Anthony 
> *Date: *Thursday, 10 August 2017 at 10.38
> *To: *Hemanth Gudela 
> *Cc: *"user@spark.apache.org" 
> *Subject: *Re: spark.write.csv is not able write files to specified path,
> but is writing to unintended subfolder _temporary/0/task_xxx folder on
> worker nodes
>
> Normally the* _temporary* directory gets deleted as part of the cleanup
> when the write is complete and a SUCCESS file is created. I suspect that
> the writes are not properly completed. How are you specifying the write ?
> Any error messages in the logs ?
>
> On Thu, Aug 10, 2017 at 3:17 AM, Hemanth Gudela <
> hemanth.gud...@qvantel.com> wrote:
>
> Hi,
>
> I’m running spark on cluster mode containing 4 nodes, and trying to write
> CSV files to node’s local path (*not HDFS*).
> I’m spark.write.csv to write CSV files.
>
> *On master node*:
> spark.write.csv creates a folder with csv file name and writes many files
> with part-r-000n suffix. This is okay for me, I can merge them later.
> *But on worker nodes*:
> spark.write.csv creates a folder with csv file name and
> writes many folders and files under _temporary/0/. This is not okay for me.
> Could someone please suggest me what could have been going wrong in my
> settings/how to be able to write csv files to the specified folder, and not
> to subfolders (_temporary/0/task_xxx) in worker machines.
>
> Thank you,
> Hemanth
>
>
>
>
>
> --
> http://www.femibyte.com/twiki5/bin/view/Tech/
> http://www.nextmatrix.com
> "Great spirits have always encountered violent opposition from mediocre
> minds." - Albert Einstein.
>
>


Re: Does Spark SQL uses Calcite?

2017-08-10 Thread Sathish Kumaran Vairavelu
I think it is for hive dependency.
On Thu, Aug 10, 2017 at 4:14 PM kant kodali <kanth...@gmail.com> wrote:

> Since I see a calcite dependency in Spark I wonder where Calcite is being
> used?
>
> On Thu, Aug 10, 2017 at 1:30 PM, Sathish Kumaran Vairavelu <
> vsathishkuma...@gmail.com> wrote:
>
>> Spark SQL doesn't use Calcite
>>
>> On Thu, Aug 10, 2017 at 3:14 PM kant kodali <kanth...@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>> Does Spark SQL uses Calcite? If so, what for? I thought the Spark SQL
>>> has catalyst which would generate its own logical plans, physical plans and
>>> other optimizations.
>>>
>>> Thanks,
>>> Kant
>>>
>>
>


Re: Does Spark SQL uses Calcite?

2017-08-10 Thread Sathish Kumaran Vairavelu
Spark SQL doesn't use Calcite
On Thu, Aug 10, 2017 at 3:14 PM kant kodali  wrote:

> Hi All,
>
> Does Spark SQL uses Calcite? If so, what for? I thought the Spark SQL has
> catalyst which would generate its own logical plans, physical plans and
> other optimizations.
>
> Thanks,
> Kant
>


Re: Spark Streaming: Async action scheduling inside foreachRDD

2017-08-04 Thread Sathish Kumaran Vairavelu
Forkjoinpool with task support would help in this case. Where u can create
a thread pool with configured number of threads ( make sure u have enough
cores) and submit job I mean actions to the pool
On Fri, Aug 4, 2017 at 8:54 AM Raghavendra Pandey <
raghavendra.pan...@gmail.com> wrote:

> Did you try SparkContext.addSparkListener?
>
>
>
> On Aug 3, 2017 1:54 AM, "Andrii Biletskyi"
>  wrote:
>
>> Hi all,
>>
>> What is the correct way to schedule multiple jobs inside foreachRDD
>> method and importantly await on result to ensure those jobs have completed
>> successfully?
>> E.g.:
>>
>> kafkaDStream.foreachRDD{ rdd =>
>> val rdd1 = rdd.map(...)
>> val rdd2 = rdd1.map(...)
>>
>> val job1Future = Future{
>> rdd1.saveToCassandra(...)
>> }
>>
>> val job2Future = Future{
>> rdd1.foreachPartition( iter => /* save to Kafka */)
>> }
>>
>>   Await.result(
>>   Future.sequence(job1Future, job2Future),
>>   Duration.Inf)
>>
>>
>>// commit Kafka offsets
>> }
>>
>> In this code I'm scheduling two actions in futures and awaiting them. I
>> need to be sure when I commit Kafka offsets at the end of the batch
>> processing that job1 and job2 have actually executed successfully. Does
>> given approach provide these guarantees? I.e. in case one of the jobs fails
>> the entire batch will be marked as failed too?
>>
>>
>> Thanks,
>> Andrii
>>
>


Re: some Ideas on expressing Spark SQL using JSON

2017-07-26 Thread Sathish Kumaran Vairavelu
Agreed. For the same reason dataframes / dataset which is another DSL used
in Spark
On Wed, Jul 26, 2017 at 1:00 AM Georg Heiler <georg.kf.hei...@gmail.com>
wrote:

> Because sparks dsl partially supports compile time type safety. E.g. the
> compiler will notify you that a sql function was misspelled when using the
> dsl opposed to the plain sql string which is only parsed at runtime.
> Sathish Kumaran Vairavelu <vsathishkuma...@gmail.com> schrieb am Di. 25.
> Juli 2017 um 23:42:
>
>> Just a thought. SQL itself is a DSL. Why DSL on top of another DSL?
>> On Tue, Jul 25, 2017 at 4:47 AM kant kodali <kanth...@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>> I am thinking to express Spark SQL using JSON in the following the way.
>>>
>>> For Example:
>>>
>>> *Query using Spark DSL*
>>>
>>> DS.filter(col("name").equalTo("john"))
>>> .groupBy(functions.window(df1.col("TIMESTAMP"), "24 hours", "24 
>>> hours"), df1.col("hourlyPay"))
>>> .agg(sum("hourlyPay").as("total"));
>>>
>>>
>>> *Query using JSON*
>>>
>>>
>>>
>>> ​
>>> ​
>>> The Goal is to design a DSL in JSON such that users can and express
>>> SPARK SQL queries in JSON so users can send Spark SQL queries over rest and
>>> get the results out. Now, I am sure there are BI tools and notebooks like
>>> Zeppelin that can accomplish the desired behavior however I believe there
>>> maybe group of users who don't want to use those BI tools or notebooks
>>> instead they want all the communication from front end to back end using
>>> API's.
>>>
>>> Also another goal would be the DSL design in JSON should closely mimic
>>> the underlying Spark SQL DSL.
>>>
>>> Please feel free to provide some feedback or criticize to whatever
>>> extent you like!
>>>
>>> Thanks!
>>>
>>>
>>>


Re: some Ideas on expressing Spark SQL using JSON

2017-07-25 Thread Sathish Kumaran Vairavelu
Just a thought. SQL itself is a DSL. Why DSL on top of another DSL?
On Tue, Jul 25, 2017 at 4:47 AM kant kodali  wrote:

> Hi All,
>
> I am thinking to express Spark SQL using JSON in the following the way.
>
> For Example:
>
> *Query using Spark DSL*
>
> DS.filter(col("name").equalTo("john"))
> .groupBy(functions.window(df1.col("TIMESTAMP"), "24 hours", "24 
> hours"), df1.col("hourlyPay"))
> .agg(sum("hourlyPay").as("total"));
>
>
> *Query using JSON*
>
>
>
> ​
> ​
> The Goal is to design a DSL in JSON such that users can and express SPARK
> SQL queries in JSON so users can send Spark SQL queries over rest and get
> the results out. Now, I am sure there are BI tools and notebooks like
> Zeppelin that can accomplish the desired behavior however I believe there
> maybe group of users who don't want to use those BI tools or notebooks
> instead they want all the communication from front end to back end using
> API's.
>
> Also another goal would be the DSL design in JSON should closely mimic the
> underlying Spark SQL DSL.
>
> Please feel free to provide some feedback or criticize to whatever extent
> you like!
>
> Thanks!
>
>
>


Re: Spark SQL 2.1 Complex SQL - Query Planning Issue

2017-04-02 Thread Sathish Kumaran Vairavelu
Please let me know if anybody has any thoughts on this issue?

On Thu, Mar 30, 2017 at 10:37 PM Sathish Kumaran Vairavelu <
vsathishkuma...@gmail.com> wrote:

> Also, is it possible to cache logical plan and parsed query so that in
> subsequent executions it can be reused. It would improve overall query
> performance particularly in streaming jobs
> On Thu, Mar 30, 2017 at 10:06 PM Sathish Kumaran Vairavelu <
> vsathishkuma...@gmail.com> wrote:
>
> Hi Ayan,
>
> I have searched Spark configuration options but couldn't find one to pin
> execution plans in memory. Can you please help?
>
>
> Thanks
>
> Sathish
>
> On Thu, Mar 30, 2017 at 9:30 PM ayan guha <guha.a...@gmail.com> wrote:
>
> I think there is an option of pinning execution plans in memory to avoid
> such scenarios
>
> On Fri, Mar 31, 2017 at 1:25 PM, Sathish Kumaran Vairavelu <
> vsathishkuma...@gmail.com> wrote:
>
> Hi Everyone,
>
> I have complex SQL with approx 2000 lines of code and works with 50+
> tables with 50+ left joins and transformations. All the tables are fully
> cached in Memory with sufficient storage memory and working memory. The
> issue is after the launch of the query for the execution; the query takes
> approximately 40 seconds to appear in the Jobs/SQL in the application UI.
>
> While the execution takes only 25 seconds; the execution is delayed by 40
> seconds by the scheduler so the total runtime of the query becomes 65
> seconds(40s + 25s). Also, there are enough cores available during this wait
> time. I couldn't figure out why DAG scheduler is delaying the execution by
> 40 seconds. Is this due to time taken for Query Parsing and Query Planning
> for the Complex SQL? If thats the case; how do we optimize this Query
> Parsing and Query Planning time in Spark? Any help would be helpful.
>
>
> Thanks
>
> Sathish
>
>
>
>
> --
> Best Regards,
> Ayan Guha
>
>


Re: Spark SQL 2.1 Complex SQL - Query Planning Issue

2017-03-30 Thread Sathish Kumaran Vairavelu
Also, is it possible to cache logical plan and parsed query so that in
subsequent executions it can be reused. It would improve overall query
performance particularly in streaming jobs
On Thu, Mar 30, 2017 at 10:06 PM Sathish Kumaran Vairavelu <
vsathishkuma...@gmail.com> wrote:

> Hi Ayan,
>
> I have searched Spark configuration options but couldn't find one to pin
> execution plans in memory. Can you please help?
>
>
> Thanks
>
> Sathish
>
> On Thu, Mar 30, 2017 at 9:30 PM ayan guha <guha.a...@gmail.com> wrote:
>
> I think there is an option of pinning execution plans in memory to avoid
> such scenarios....
>
> On Fri, Mar 31, 2017 at 1:25 PM, Sathish Kumaran Vairavelu <
> vsathishkuma...@gmail.com> wrote:
>
> Hi Everyone,
>
> I have complex SQL with approx 2000 lines of code and works with 50+
> tables with 50+ left joins and transformations. All the tables are fully
> cached in Memory with sufficient storage memory and working memory. The
> issue is after the launch of the query for the execution; the query takes
> approximately 40 seconds to appear in the Jobs/SQL in the application UI.
>
> While the execution takes only 25 seconds; the execution is delayed by 40
> seconds by the scheduler so the total runtime of the query becomes 65
> seconds(40s + 25s). Also, there are enough cores available during this wait
> time. I couldn't figure out why DAG scheduler is delaying the execution by
> 40 seconds. Is this due to time taken for Query Parsing and Query Planning
> for the Complex SQL? If thats the case; how do we optimize this Query
> Parsing and Query Planning time in Spark? Any help would be helpful.
>
>
> Thanks
>
> Sathish
>
>
>
>
> --
> Best Regards,
> Ayan Guha
>
>


Re: Spark SQL 2.1 Complex SQL - Query Planning Issue

2017-03-30 Thread Sathish Kumaran Vairavelu
Hi Ayan,

I have searched Spark configuration options but couldn't find one to pin
execution plans in memory. Can you please help?


Thanks

Sathish

On Thu, Mar 30, 2017 at 9:30 PM ayan guha <guha.a...@gmail.com> wrote:

> I think there is an option of pinning execution plans in memory to avoid
> such scenarios
>
> On Fri, Mar 31, 2017 at 1:25 PM, Sathish Kumaran Vairavelu <
> vsathishkuma...@gmail.com> wrote:
>
> Hi Everyone,
>
> I have complex SQL with approx 2000 lines of code and works with 50+
> tables with 50+ left joins and transformations. All the tables are fully
> cached in Memory with sufficient storage memory and working memory. The
> issue is after the launch of the query for the execution; the query takes
> approximately 40 seconds to appear in the Jobs/SQL in the application UI.
>
> While the execution takes only 25 seconds; the execution is delayed by 40
> seconds by the scheduler so the total runtime of the query becomes 65
> seconds(40s + 25s). Also, there are enough cores available during this wait
> time. I couldn't figure out why DAG scheduler is delaying the execution by
> 40 seconds. Is this due to time taken for Query Parsing and Query Planning
> for the Complex SQL? If thats the case; how do we optimize this Query
> Parsing and Query Planning time in Spark? Any help would be helpful.
>
>
> Thanks
>
> Sathish
>
>
>
>
> --
> Best Regards,
> Ayan Guha
>


Spark SQL 2.1 Complex SQL - Query Planning Issue

2017-03-30 Thread Sathish Kumaran Vairavelu
Hi Everyone,

I have complex SQL with approx 2000 lines of code and works with 50+ tables
with 50+ left joins and transformations. All the tables are fully cached in
Memory with sufficient storage memory and working memory. The issue is
after the launch of the query for the execution; the query takes
approximately 40 seconds to appear in the Jobs/SQL in the application UI.

While the execution takes only 25 seconds; the execution is delayed by 40
seconds by the scheduler so the total runtime of the query becomes 65
seconds(40s + 25s). Also, there are enough cores available during this wait
time. I couldn't figure out why DAG scheduler is delaying the execution by
40 seconds. Is this due to time taken for Query Parsing and Query Planning
for the Complex SQL? If thats the case; how do we optimize this Query
Parsing and Query Planning time in Spark? Any help would be helpful.


Thanks

Sathish


Re: Spark Job trigger in production

2016-07-20 Thread Sathish Kumaran Vairavelu
If you are using Mesos, then u can use Chronos or Marathon
On Wed, Jul 20, 2016 at 6:08 AM Rabin Banerjee 
wrote:

> ++ crontab :)
>
> On Wed, Jul 20, 2016 at 9:07 AM, Andrew Ehrlich 
> wrote:
>
>> Another option is Oozie with the spark action:
>> https://oozie.apache.org/docs/4.2.0/DG_SparkActionExtension.html
>>
>> On Jul 18, 2016, at 12:15 AM, Jagat Singh  wrote:
>>
>> You can use following options
>>
>> * spark-submit from shell
>> * some kind of job server. See spark-jobserver for details
>> * some notebook environment See Zeppelin for example
>>
>>
>>
>>
>>
>> On 18 July 2016 at 17:13, manish jaiswal  wrote:
>>
>>> Hi,
>>>
>>>
>>> What is the best approach to trigger spark job in production cluster?
>>>
>>
>>
>>
>


Re: Best practice for handing tables between pipeline components

2016-06-27 Thread Sathish Kumaran Vairavelu
Alluxio off heap memory would help to share cached objects
On Mon, Jun 27, 2016 at 11:14 AM Everett Anderson 
wrote:

> Hi,
>
> We have a pipeline of components strung together via Airflow running on
> AWS. Some of them are implemented in Spark, but some aren't. Generally they
> can all talk to a JDBC/ODBC end point or read/write files from S3.
>
> Ideally, we wouldn't suffer the I/O cost of writing all the data to HDFS
> or S3 and reading it back in, again, in every component, if it could stay
> cached in memory in a Spark cluster.
>
> Our current investigation seems to lead us towards exploring if the
> following things are possible:
>
>- Using a Hive metastore with S3 as its backing data store to try to
>keep a mapping from table name to files on S3 (not sure if one can cache a
>Hive table in Spark across contexts, though)
>- Using something like the spark-jobserver to keep a Spark SQLContext
>open across Spark components so they could avoid file I/O for cached tables
>
> What's the best practice for handing tables between Spark programs? What
> about between Spark and non-Spark programs?
>
> Thanks!
>
> - Everett
>
>


Re: Spark 1.5 on Mesos

2016-03-02 Thread Sathish Kumaran Vairavelu
Try passing jar using --jars option
On Wed, Mar 2, 2016 at 10:17 AM Ashish Soni <asoni.le...@gmail.com> wrote:

> I made some progress but now i am stuck at this point , Please help as
> looks like i am close to get it working
>
> I have everything running in docker container including mesos slave and
> master
>
> When i try to submit the pi example i get below error
> *Error: Cannot load main class from JAR file:/opt/spark/Example*
>
> Below is the command i use to submit as a docker container
>
> docker run -it --rm -e SPARK_MASTER="mesos://10.0.2.15:7077"  -e
> SPARK_IMAGE="spark_driver:latest" spark_driver:latest ./bin/spark-submit
> --deploy-mode cluster --name "PI Example" --class
> org.apache.spark.examples.SparkPi --driver-memory 512m --executor-memory
> 512m --executor-cores 1
> http://10.0.2.15/spark-examples-1.6.0-hadoop2.6.0.jar
>
>
> On Tue, Mar 1, 2016 at 2:59 PM, Timothy Chen <t...@mesosphere.io> wrote:
>
>> Can you go through the Mesos UI and look at the driver/executor log from
>> steer file and see what the problem is?
>>
>> Tim
>>
>> On Mar 1, 2016, at 8:05 AM, Ashish Soni <asoni.le...@gmail.com> wrote:
>>
>> Not sure what is the issue but i am getting below error  when i try to
>> run spark PI example
>>
>> Blacklisting Mesos slave value: "5345asdasdasdkas234234asdasdasdasd"
>>due to too many failures; is Spark installed on it?
>> WARN TaskSchedulerImpl: Initial job has not accepted any resources; 
>> check your cluster UI to ensure that workers are registered and have 
>> sufficient resources
>>
>>
>> On Mon, Feb 29, 2016 at 1:39 PM, Sathish Kumaran Vairavelu <
>> vsathishkuma...@gmail.com> wrote:
>>
>>> May be the Mesos executor couldn't find spark image or the constraints
>>> are not satisfied. Check your Mesos UI if you see Spark application in the
>>> Frameworks tab
>>>
>>> On Mon, Feb 29, 2016 at 12:23 PM Ashish Soni <asoni.le...@gmail.com>
>>> wrote:
>>>
>>>> What is the Best practice , I have everything running as docker
>>>> container in single host ( mesos and marathon also as docker container )
>>>>  and everything comes up fine but when i try to launch the spark shell i
>>>> get below error
>>>>
>>>>
>>>> SQL context available as sqlContext.
>>>>
>>>> scala> val data = sc.parallelize(1 to 100)
>>>> data: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at
>>>> parallelize at :27
>>>>
>>>> scala> data.count
>>>> [Stage 0:>  (0
>>>> + 0) / 2]16/02/29 18:21:12 WARN TaskSchedulerImpl: Initial job has not
>>>> accepted any resources; check your cluster UI to ensure that workers are
>>>> registered and have sufficient resources
>>>> 16/02/29 18:21:27 WARN TaskSchedulerImpl: Initial job has not accepted
>>>> any resources; check your cluster UI to ensure that workers are registered
>>>> and have sufficient resources
>>>>
>>>>
>>>>
>>>> On Mon, Feb 29, 2016 at 12:04 PM, Tim Chen <t...@mesosphere.io> wrote:
>>>>
>>>>> No you don't have to run Mesos in docker containers to run Spark in
>>>>> docker containers.
>>>>>
>>>>> Once you have Mesos cluster running you can then specfiy the Spark
>>>>> configurations in your Spark job (i.e: 
>>>>> spark.mesos.executor.docker.image=mesosphere/spark:1.6)
>>>>> and Mesos will automatically launch docker containers for you.
>>>>>
>>>>> Tim
>>>>>
>>>>> On Mon, Feb 29, 2016 at 7:36 AM, Ashish Soni <asoni.le...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Yes i read that and not much details here.
>>>>>>
>>>>>> Is it true that we need to have spark installed on each mesos docker
>>>>>> container ( master and slave ) ...
>>>>>>
>>>>>> Ashish
>>>>>>
>>>>>> On Fri, Feb 26, 2016 at 2:14 PM, Tim Chen <t...@mesosphere.io> wrote:
>>>>>>
>>>>>>> https://spark.apache.org/docs/latest/running-on-mesos.html should
>>>>>>> be the best source, what problems were you running into?
>>>>>>>
>>>>>>> Tim
>>>>>>>
>>>>>>> On Fri, Feb 26, 2016 at 11:06 AM, Yin Yang <yy201...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Have you read this ?
>>>>>>>> https://spark.apache.org/docs/latest/running-on-mesos.html
>>>>>>>>
>>>>>>>> On Fri, Feb 26, 2016 at 11:03 AM, Ashish Soni <
>>>>>>>> asoni.le...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi All ,
>>>>>>>>>
>>>>>>>>> Is there any proper documentation as how to run spark on mesos , I
>>>>>>>>> am trying from the last few days and not able to make it work.
>>>>>>>>>
>>>>>>>>> Please help
>>>>>>>>>
>>>>>>>>> Ashish
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>
>


Re: Passing multiple jar files to spark-shell

2016-02-14 Thread Sathish Kumaran Vairavelu
--jars takes comma separated values.

On Sun, Feb 14, 2016 at 5:35 PM Mich Talebzadeh  wrote:

> Hi,
>
>
>
> Is there anyway one can pass multiple --driver-class-path and multiple
> –jars to spark shell.
>
>
>
> For example something as below with two jar files entries for Oracle
> (ojdbc6.jar) and Sybase IQ (jcoon4,jar)
>
>
>
> spark-shell --master spark://50.140.197.217:7077 --driver-class-path
> /home/hduser/jars/jconn4.jar  --driver-class-path
> /home/hduser/jars/ojdbc6.jar --jars /home/hduser/jars/jconn4.jar --jars
> /home/hduser/jars/ojdbc6.jar
>
>
>
>
>
> This works for one jar file only and you need to add the jar file to both 
> --driver-class-path
> and –jars. I have not managed to work for more than one type of JAR file
>
>
>
> Thanks,
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>
>
>
>
>


Re: Spark, Mesos, Docker and S3

2016-01-29 Thread Sathish Kumaran Vairavelu
Hi

Quick question. How to pass constraint [["hostname", "CLUSTER", "
specific.node.com"]] to mesos?

I was trying --conf spark.mesos.constraints=hostname:specific.node.com. But
it didn't seems working


Please help


Thanks

Sathish
On Thu, Jan 28, 2016 at 6:52 PM Mao Geng <m...@sumologic.com> wrote:

> From my limited knowledge, only limited options such as network mode,
> volumes, portmaps can be passed through. See
> https://github.com/apache/spark/pull/3074/files.
>
> https://issues.apache.org/jira/browse/SPARK-8734 is open for exposing all
> docker options to spark.
>
> -Mao
>
> On Thu, Jan 28, 2016 at 1:55 PM, Sathish Kumaran Vairavelu <
> vsathishkuma...@gmail.com> wrote:
>
>> Thank you., I figured it out. I have set executor memory to minimal and
>> it works.,
>>
>> Another issue has come.. I have to pass --add-host option while running
>> containers in slave nodes.. Is there any option to pass docker run
>> parameters from spark?
>> On Thu, Jan 28, 2016 at 12:26 PM Mao Geng <m...@sumologic.com> wrote:
>>
>>> Sathish,
>>>
>>> I guess the mesos resources are not enough to run your job. You might
>>> want to check the mesos log to figure out why.
>>>
>>> I tried to run the docker image with "--conf spark.mesos.coarse=false"
>>> and "true". Both are fine.
>>>
>>> Best,
>>> Mao
>>>
>>> On Wed, Jan 27, 2016 at 5:00 PM, Sathish Kumaran Vairavelu <
>>> vsathishkuma...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> On the same Spark/Mesos/Docker setup, I am getting warning "Initial Job
>>>> has not accepted any resources; check your cluster UI to ensure that
>>>> workers are registered and have sufficient resources". I am running in
>>>> coarse grained mode. Any pointers on how to fix this issue? Please help. I
>>>> have updated both docker.properties and spark-default.conf with  
>>>> spark.mesos.executor.docker.image
>>>> and other properties.
>>>>
>>>>
>>>> Thanks
>>>>
>>>> Sathish
>>>>
>>>> On Wed, Jan 27, 2016 at 9:58 AM Sathish Kumaran Vairavelu <
>>>> vsathishkuma...@gmail.com> wrote:
>>>>
>>>>> Thanks a lot for your info! I will try this today.
>>>>> On Wed, Jan 27, 2016 at 9:29 AM Mao Geng <m...@sumologic.com> wrote:
>>>>>
>>>>>> Hi Sathish,
>>>>>>
>>>>>> The docker image is normal, no AWS profile included.
>>>>>>
>>>>>> When the driver container runs with --net=host, the driver host's AWS
>>>>>> profile will take effect so that the driver can access the protected s3
>>>>>> files.
>>>>>>
>>>>>> Similarly,  Mesos slaves also run Spark executor docker container in
>>>>>> --net=host mode, so that the AWS profile of Mesos slaves will take 
>>>>>> effect.
>>>>>>
>>>>>> Hope it helps,
>>>>>> Mao
>>>>>>
>>>>>> On Jan 26, 2016, at 9:15 PM, Sathish Kumaran Vairavelu <
>>>>>> vsathishkuma...@gmail.com> wrote:
>>>>>>
>>>>>> Hi Mao,
>>>>>>
>>>>>> I want to check on accessing the S3 from Spark docker in Mesos.  The
>>>>>> EC2 instance that I am using has the AWS profile/IAM included.  Should we
>>>>>> build the docker image with any AWS profile settings or --net=host docker
>>>>>> option takes care of it?
>>>>>>
>>>>>> Please help
>>>>>>
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> Sathish
>>>>>>
>>>>>> On Tue, Jan 26, 2016 at 9:04 PM Mao Geng <m...@sumologic.com> wrote:
>>>>>>
>>>>>>> Thank you very much, Jerry!
>>>>>>>
>>>>>>> I changed to "--jars
>>>>>>> /opt/spark/lib/hadoop-aws-2.7.1.jar,/opt/spark/lib/aws-java-sdk-1.7.4.jar"
>>>>>>> then it worked like a charm!
>>>>>>>
>>>>>>> From Mesos task logs below, I saw Mesos executor downloaded the jars
>>>>>>> from the driver, which is a bit unnecessary (as the docker image already
>>>>>>> has them), but that's

Re: Spark, Mesos, Docker and S3

2016-01-28 Thread Sathish Kumaran Vairavelu
Thank you., I figured it out. I have set executor memory to minimal and it
works.,

Another issue has come.. I have to pass --add-host option while running
containers in slave nodes.. Is there any option to pass docker run
parameters from spark?
On Thu, Jan 28, 2016 at 12:26 PM Mao Geng <m...@sumologic.com> wrote:

> Sathish,
>
> I guess the mesos resources are not enough to run your job. You might want
> to check the mesos log to figure out why.
>
> I tried to run the docker image with "--conf spark.mesos.coarse=false" and
> "true". Both are fine.
>
> Best,
> Mao
>
> On Wed, Jan 27, 2016 at 5:00 PM, Sathish Kumaran Vairavelu <
> vsathishkuma...@gmail.com> wrote:
>
>> Hi,
>>
>> On the same Spark/Mesos/Docker setup, I am getting warning "Initial Job
>> has not accepted any resources; check your cluster UI to ensure that
>> workers are registered and have sufficient resources". I am running in
>> coarse grained mode. Any pointers on how to fix this issue? Please help. I
>> have updated both docker.properties and spark-default.conf with  
>> spark.mesos.executor.docker.image
>> and other properties.
>>
>>
>> Thanks
>>
>> Sathish
>>
>> On Wed, Jan 27, 2016 at 9:58 AM Sathish Kumaran Vairavelu <
>> vsathishkuma...@gmail.com> wrote:
>>
>>> Thanks a lot for your info! I will try this today.
>>> On Wed, Jan 27, 2016 at 9:29 AM Mao Geng <m...@sumologic.com> wrote:
>>>
>>>> Hi Sathish,
>>>>
>>>> The docker image is normal, no AWS profile included.
>>>>
>>>> When the driver container runs with --net=host, the driver host's AWS
>>>> profile will take effect so that the driver can access the protected s3
>>>> files.
>>>>
>>>> Similarly,  Mesos slaves also run Spark executor docker container in
>>>> --net=host mode, so that the AWS profile of Mesos slaves will take effect.
>>>>
>>>> Hope it helps,
>>>> Mao
>>>>
>>>> On Jan 26, 2016, at 9:15 PM, Sathish Kumaran Vairavelu <
>>>> vsathishkuma...@gmail.com> wrote:
>>>>
>>>> Hi Mao,
>>>>
>>>> I want to check on accessing the S3 from Spark docker in Mesos.  The
>>>> EC2 instance that I am using has the AWS profile/IAM included.  Should we
>>>> build the docker image with any AWS profile settings or --net=host docker
>>>> option takes care of it?
>>>>
>>>> Please help
>>>>
>>>>
>>>> Thanks
>>>>
>>>> Sathish
>>>>
>>>> On Tue, Jan 26, 2016 at 9:04 PM Mao Geng <m...@sumologic.com> wrote:
>>>>
>>>>> Thank you very much, Jerry!
>>>>>
>>>>> I changed to "--jars
>>>>> /opt/spark/lib/hadoop-aws-2.7.1.jar,/opt/spark/lib/aws-java-sdk-1.7.4.jar"
>>>>> then it worked like a charm!
>>>>>
>>>>> From Mesos task logs below, I saw Mesos executor downloaded the jars
>>>>> from the driver, which is a bit unnecessary (as the docker image already
>>>>> has them), but that's ok - I am happy seeing Spark + Mesos + Docker + S3
>>>>> worked together!
>>>>>
>>>>> Thanks,
>>>>> Mao
>>>>>
>>>>> 16/01/27 02:54:45 INFO Executor: Using REPL class URI: 
>>>>> http://172.16.3.98:33771
>>>>> 16/01/27 02:55:12 INFO CoarseGrainedExecutorBackend: Got assigned task 0
>>>>> 16/01/27 02:55:12 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
>>>>> 16/01/27 02:55:12 INFO Executor: Fetching 
>>>>> http://172.16.3.98:3850/jars/hadoop-aws-2.7.1.jar with timestamp 
>>>>> 1453863280432
>>>>> 16/01/27 02:55:12 INFO Utils: Fetching 
>>>>> http://172.16.3.98:3850/jars/hadoop-aws-2.7.1.jar to 
>>>>> /tmp/spark-7b8e1681-8a62-4f1d-9e11-fdf8062b1b08/fetchFileTemp1518118694295619525.tmp
>>>>> 16/01/27 02:55:12 INFO Utils: Copying 
>>>>> /tmp/spark-7b8e1681-8a62-4f1d-9e11-fdf8062b1b08/-19880839621453863280432_cache
>>>>>  to /./hadoop-aws-2.7.1.jar
>>>>> 16/01/27 02:55:12 INFO Executor: Adding file:/./hadoop-aws-2.7.1.jar to 
>>>>> class loader
>>>>> 16/01/27 02:55:12 INFO Executor: Fetching 
>>>>> http://172.16.3.98:3850/jars/aws-java-sdk-1.7.4.jar with timestamp 
>>>>> 1453863280472
>>>>> 16/01/27

Re: Spark, Mesos, Docker and S3

2016-01-27 Thread Sathish Kumaran Vairavelu
Hi,

On the same Spark/Mesos/Docker setup, I am getting warning "Initial Job has
not accepted any resources; check your cluster UI to ensure that workers
are registered and have sufficient resources". I am running in coarse
grained mode. Any pointers on how to fix this issue? Please help. I have
updated both docker.properties and spark-default.conf with
spark.mesos.executor.docker.image
and other properties.


Thanks

Sathish

On Wed, Jan 27, 2016 at 9:58 AM Sathish Kumaran Vairavelu <
vsathishkuma...@gmail.com> wrote:

> Thanks a lot for your info! I will try this today.
> On Wed, Jan 27, 2016 at 9:29 AM Mao Geng <m...@sumologic.com> wrote:
>
>> Hi Sathish,
>>
>> The docker image is normal, no AWS profile included.
>>
>> When the driver container runs with --net=host, the driver host's AWS
>> profile will take effect so that the driver can access the protected s3
>> files.
>>
>> Similarly,  Mesos slaves also run Spark executor docker container in
>> --net=host mode, so that the AWS profile of Mesos slaves will take effect.
>>
>> Hope it helps,
>> Mao
>>
>> On Jan 26, 2016, at 9:15 PM, Sathish Kumaran Vairavelu <
>> vsathishkuma...@gmail.com> wrote:
>>
>> Hi Mao,
>>
>> I want to check on accessing the S3 from Spark docker in Mesos.  The EC2
>> instance that I am using has the AWS profile/IAM included.  Should we build
>> the docker image with any AWS profile settings or --net=host docker option
>> takes care of it?
>>
>> Please help
>>
>>
>> Thanks
>>
>> Sathish
>>
>> On Tue, Jan 26, 2016 at 9:04 PM Mao Geng <m...@sumologic.com> wrote:
>>
>>> Thank you very much, Jerry!
>>>
>>> I changed to "--jars
>>> /opt/spark/lib/hadoop-aws-2.7.1.jar,/opt/spark/lib/aws-java-sdk-1.7.4.jar"
>>> then it worked like a charm!
>>>
>>> From Mesos task logs below, I saw Mesos executor downloaded the jars
>>> from the driver, which is a bit unnecessary (as the docker image already
>>> has them), but that's ok - I am happy seeing Spark + Mesos + Docker + S3
>>> worked together!
>>>
>>> Thanks,
>>> Mao
>>>
>>> 16/01/27 02:54:45 INFO Executor: Using REPL class URI: 
>>> http://172.16.3.98:33771
>>> 16/01/27 02:55:12 INFO CoarseGrainedExecutorBackend: Got assigned task 0
>>> 16/01/27 02:55:12 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
>>> 16/01/27 02:55:12 INFO Executor: Fetching 
>>> http://172.16.3.98:3850/jars/hadoop-aws-2.7.1.jar with timestamp 
>>> 1453863280432
>>> 16/01/27 02:55:12 INFO Utils: Fetching 
>>> http://172.16.3.98:3850/jars/hadoop-aws-2.7.1.jar to 
>>> /tmp/spark-7b8e1681-8a62-4f1d-9e11-fdf8062b1b08/fetchFileTemp1518118694295619525.tmp
>>> 16/01/27 02:55:12 INFO Utils: Copying 
>>> /tmp/spark-7b8e1681-8a62-4f1d-9e11-fdf8062b1b08/-19880839621453863280432_cache
>>>  to /./hadoop-aws-2.7.1.jar
>>> 16/01/27 02:55:12 INFO Executor: Adding file:/./hadoop-aws-2.7.1.jar to 
>>> class loader
>>> 16/01/27 02:55:12 INFO Executor: Fetching 
>>> http://172.16.3.98:3850/jars/aws-java-sdk-1.7.4.jar with timestamp 
>>> 1453863280472
>>> 16/01/27 02:55:12 INFO Utils: Fetching 
>>> http://172.16.3.98:3850/jars/aws-java-sdk-1.7.4.jar to 
>>> /tmp/spark-7b8e1681-8a62-4f1d-9e11-fdf8062b1b08/fetchFileTemp8868621397726761921.tmp
>>> 16/01/27 02:55:12 INFO Utils: Copying 
>>> /tmp/spark-7b8e1681-8a62-4f1d-9e11-fdf8062b1b08/8167072821453863280472_cache
>>>  to /./aws-java-sdk-1.7.4.jar
>>> 16/01/27 02:55:12 INFO Executor: Adding file:/./aws-java-sdk-1.7.4.jar to 
>>> class loader
>>>
>>> On Tue, Jan 26, 2016 at 5:40 PM, Jerry Lam <chiling...@gmail.com> wrote:
>>>
>>>> Hi Mao,
>>>>
>>>> Can you try --jars to include those jars?
>>>>
>>>> Best Regards,
>>>>
>>>> Jerry
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On 26 Jan, 2016, at 7:02 pm, Mao Geng <m...@sumologic.com> wrote:
>>>>
>>>> Hi there,
>>>>
>>>> I am trying to run Spark on Mesos using a Docker image as executor, as
>>>> mentioned
>>>> http://spark.apache.org/docs/latest/running-on-mesos.html#mesos-docker-support
>>>> .
>>>>
>>>> I built a docker image using the following Dockerfile (which is based
>>>> on
>>>> https://github.com/apache/spark/blob/master/docker/spark-mesos/Doc

Re: Spark, Mesos, Docker and S3

2016-01-27 Thread Sathish Kumaran Vairavelu
Thanks a lot for your info! I will try this today.
On Wed, Jan 27, 2016 at 9:29 AM Mao Geng <m...@sumologic.com> wrote:

> Hi Sathish,
>
> The docker image is normal, no AWS profile included.
>
> When the driver container runs with --net=host, the driver host's AWS
> profile will take effect so that the driver can access the protected s3
> files.
>
> Similarly,  Mesos slaves also run Spark executor docker container in
> --net=host mode, so that the AWS profile of Mesos slaves will take effect.
>
> Hope it helps,
> Mao
>
> On Jan 26, 2016, at 9:15 PM, Sathish Kumaran Vairavelu <
> vsathishkuma...@gmail.com> wrote:
>
> Hi Mao,
>
> I want to check on accessing the S3 from Spark docker in Mesos.  The EC2
> instance that I am using has the AWS profile/IAM included.  Should we build
> the docker image with any AWS profile settings or --net=host docker option
> takes care of it?
>
> Please help
>
>
> Thanks
>
> Sathish
>
> On Tue, Jan 26, 2016 at 9:04 PM Mao Geng <m...@sumologic.com> wrote:
>
>> Thank you very much, Jerry!
>>
>> I changed to "--jars
>> /opt/spark/lib/hadoop-aws-2.7.1.jar,/opt/spark/lib/aws-java-sdk-1.7.4.jar"
>> then it worked like a charm!
>>
>> From Mesos task logs below, I saw Mesos executor downloaded the jars from
>> the driver, which is a bit unnecessary (as the docker image already has
>> them), but that's ok - I am happy seeing Spark + Mesos + Docker + S3 worked
>> together!
>>
>> Thanks,
>> Mao
>>
>> 16/01/27 02:54:45 INFO Executor: Using REPL class URI: 
>> http://172.16.3.98:33771
>> 16/01/27 02:55:12 INFO CoarseGrainedExecutorBackend: Got assigned task 0
>> 16/01/27 02:55:12 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
>> 16/01/27 02:55:12 INFO Executor: Fetching 
>> http://172.16.3.98:3850/jars/hadoop-aws-2.7.1.jar with timestamp 
>> 1453863280432
>> 16/01/27 02:55:12 INFO Utils: Fetching 
>> http://172.16.3.98:3850/jars/hadoop-aws-2.7.1.jar to 
>> /tmp/spark-7b8e1681-8a62-4f1d-9e11-fdf8062b1b08/fetchFileTemp1518118694295619525.tmp
>> 16/01/27 02:55:12 INFO Utils: Copying 
>> /tmp/spark-7b8e1681-8a62-4f1d-9e11-fdf8062b1b08/-19880839621453863280432_cache
>>  to /./hadoop-aws-2.7.1.jar
>> 16/01/27 02:55:12 INFO Executor: Adding file:/./hadoop-aws-2.7.1.jar to 
>> class loader
>> 16/01/27 02:55:12 INFO Executor: Fetching 
>> http://172.16.3.98:3850/jars/aws-java-sdk-1.7.4.jar with timestamp 
>> 1453863280472
>> 16/01/27 02:55:12 INFO Utils: Fetching 
>> http://172.16.3.98:3850/jars/aws-java-sdk-1.7.4.jar to 
>> /tmp/spark-7b8e1681-8a62-4f1d-9e11-fdf8062b1b08/fetchFileTemp8868621397726761921.tmp
>> 16/01/27 02:55:12 INFO Utils: Copying 
>> /tmp/spark-7b8e1681-8a62-4f1d-9e11-fdf8062b1b08/8167072821453863280472_cache 
>> to /./aws-java-sdk-1.7.4.jar
>> 16/01/27 02:55:12 INFO Executor: Adding file:/./aws-java-sdk-1.7.4.jar to 
>> class loader
>>
>> On Tue, Jan 26, 2016 at 5:40 PM, Jerry Lam <chiling...@gmail.com> wrote:
>>
>>> Hi Mao,
>>>
>>> Can you try --jars to include those jars?
>>>
>>> Best Regards,
>>>
>>> Jerry
>>>
>>> Sent from my iPhone
>>>
>>> On 26 Jan, 2016, at 7:02 pm, Mao Geng <m...@sumologic.com> wrote:
>>>
>>> Hi there,
>>>
>>> I am trying to run Spark on Mesos using a Docker image as executor, as
>>> mentioned
>>> http://spark.apache.org/docs/latest/running-on-mesos.html#mesos-docker-support
>>> .
>>>
>>> I built a docker image using the following Dockerfile (which is based on
>>> https://github.com/apache/spark/blob/master/docker/spark-mesos/Dockerfile
>>> ):
>>>
>>> FROM mesosphere/mesos:0.25.0-0.2.70.ubuntu1404
>>>
>>> # Update the base ubuntu image with dependencies needed for Spark
>>> RUN apt-get update && \
>>> apt-get install -y python libnss3 openjdk-7-jre-headless curl
>>>
>>> RUN curl
>>> http://www.carfab.com/apachesoftware/spark/spark-1.6.0/spark-1.6.0-bin-hadoop2.6.tgz
>>> | tar -xzC /opt && \
>>> ln -s /opt/spark-1.6.0-bin-hadoop2.6 /opt/spark
>>> ENV SPARK_HOME /opt/spark
>>> ENV MESOS_NATIVE_JAVA_LIBRARY /usr/local/lib/libmesos.so
>>>
>>> Then I successfully ran spark-shell via this docker command:
>>> docker run --rm -it --net=host /:
>>> /opt/spark/bin/spark-shell --master mesos://:5050 --conf
>>> /:
>>>
>>> So far so good. Then I wanted to call sc.textFile

Re: Spark, Mesos, Docker and S3

2016-01-26 Thread Sathish Kumaran Vairavelu
Hi Mao,

I want to check on accessing the S3 from Spark docker in Mesos.  The EC2
instance that I am using has the AWS profile/IAM included.  Should we build
the docker image with any AWS profile settings or --net=host docker option
takes care of it?

Please help


Thanks

Sathish

On Tue, Jan 26, 2016 at 9:04 PM Mao Geng  wrote:

> Thank you very much, Jerry!
>
> I changed to "--jars
> /opt/spark/lib/hadoop-aws-2.7.1.jar,/opt/spark/lib/aws-java-sdk-1.7.4.jar"
> then it worked like a charm!
>
> From Mesos task logs below, I saw Mesos executor downloaded the jars from
> the driver, which is a bit unnecessary (as the docker image already has
> them), but that's ok - I am happy seeing Spark + Mesos + Docker + S3 worked
> together!
>
> Thanks,
> Mao
>
> 16/01/27 02:54:45 INFO Executor: Using REPL class URI: 
> http://172.16.3.98:33771
> 16/01/27 02:55:12 INFO CoarseGrainedExecutorBackend: Got assigned task 0
> 16/01/27 02:55:12 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
> 16/01/27 02:55:12 INFO Executor: Fetching 
> http://172.16.3.98:3850/jars/hadoop-aws-2.7.1.jar with timestamp 1453863280432
> 16/01/27 02:55:12 INFO Utils: Fetching 
> http://172.16.3.98:3850/jars/hadoop-aws-2.7.1.jar to 
> /tmp/spark-7b8e1681-8a62-4f1d-9e11-fdf8062b1b08/fetchFileTemp1518118694295619525.tmp
> 16/01/27 02:55:12 INFO Utils: Copying 
> /tmp/spark-7b8e1681-8a62-4f1d-9e11-fdf8062b1b08/-19880839621453863280432_cache
>  to /./hadoop-aws-2.7.1.jar
> 16/01/27 02:55:12 INFO Executor: Adding file:/./hadoop-aws-2.7.1.jar to class 
> loader
> 16/01/27 02:55:12 INFO Executor: Fetching 
> http://172.16.3.98:3850/jars/aws-java-sdk-1.7.4.jar with timestamp 
> 1453863280472
> 16/01/27 02:55:12 INFO Utils: Fetching 
> http://172.16.3.98:3850/jars/aws-java-sdk-1.7.4.jar to 
> /tmp/spark-7b8e1681-8a62-4f1d-9e11-fdf8062b1b08/fetchFileTemp8868621397726761921.tmp
> 16/01/27 02:55:12 INFO Utils: Copying 
> /tmp/spark-7b8e1681-8a62-4f1d-9e11-fdf8062b1b08/8167072821453863280472_cache 
> to /./aws-java-sdk-1.7.4.jar
> 16/01/27 02:55:12 INFO Executor: Adding file:/./aws-java-sdk-1.7.4.jar to 
> class loader
>
> On Tue, Jan 26, 2016 at 5:40 PM, Jerry Lam  wrote:
>
>> Hi Mao,
>>
>> Can you try --jars to include those jars?
>>
>> Best Regards,
>>
>> Jerry
>>
>> Sent from my iPhone
>>
>> On 26 Jan, 2016, at 7:02 pm, Mao Geng  wrote:
>>
>> Hi there,
>>
>> I am trying to run Spark on Mesos using a Docker image as executor, as
>> mentioned
>> http://spark.apache.org/docs/latest/running-on-mesos.html#mesos-docker-support
>> .
>>
>> I built a docker image using the following Dockerfile (which is based on
>> https://github.com/apache/spark/blob/master/docker/spark-mesos/Dockerfile
>> ):
>>
>> FROM mesosphere/mesos:0.25.0-0.2.70.ubuntu1404
>>
>> # Update the base ubuntu image with dependencies needed for Spark
>> RUN apt-get update && \
>> apt-get install -y python libnss3 openjdk-7-jre-headless curl
>>
>> RUN curl
>> http://www.carfab.com/apachesoftware/spark/spark-1.6.0/spark-1.6.0-bin-hadoop2.6.tgz
>> | tar -xzC /opt && \
>> ln -s /opt/spark-1.6.0-bin-hadoop2.6 /opt/spark
>> ENV SPARK_HOME /opt/spark
>> ENV MESOS_NATIVE_JAVA_LIBRARY /usr/local/lib/libmesos.so
>>
>> Then I successfully ran spark-shell via this docker command:
>> docker run --rm -it --net=host /:
>> /opt/spark/bin/spark-shell --master mesos://:5050 --conf
>> /:
>>
>> So far so good. Then I wanted to call sc.textFile to load a file from S3,
>> but I was blocked by some issues which I couldn't figure out. I've read
>> https://dzone.com/articles/uniting-spark-parquet-and-s3-as-an-alternative-to
>> and
>> http://blog.encomiabile.it/2015/10/29/apache-spark-amazon-s3-and-apache-mesos,
>> learned that I need to add hadood-aws-2.7.1 and aws-java-sdk-2.7.4 into the
>> executor and driver's classpaths, in order to access s3 files.
>>
>> So, I added following lines into Dockerfile and build a new image.
>> RUN curl
>> https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar
>> -o /opt/spark/lib/aws-java-sdk-1.7.4.jar
>> RUN curl
>> http://central.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.7.1/hadoop-aws-2.7.1.jar
>> -o /opt/spark/lib/hadoop-aws-2.7.1.jar
>>
>> Then I started spark-shell again with below command:
>> docker run --rm -it --net=host /:
>> /opt/spark/bin/spark-shell --master mesos://:5050 --conf
>> /: --conf 
>> spark.executor.extraClassPath=/opt/spark/lib/hadoop-aws-2.7.1.jar:/opt/spark/lib/aws-java-sdk-1.7.4.jar
>>  --conf
>> spark.driver.extraClassPath=/opt/spark/lib/hadoop-aws-2.7.1.jar:/opt/spark/lib/aws-java-sdk-1.7.4.jar
>>
>> But below command failed when I ran it in spark-shell:
>> scala> sc.textFile("s3a:///").count()
>> [Stage 0:>  (0 +
>> 2) / 2]16/01/26 23:05:23 WARN TaskSetManager: Lost task 0.0 in stage 0.0
>> (TID 0, ip-172-16-14-203.us-west-2.compute.internal):
>> java.lang.RuntimeException: 

Re: Docker/Mesos with Spark

2016-01-19 Thread Sathish Kumaran Vairavelu
Hi Tim

Do you have any materials/blog for running Spark in a container in Mesos
cluster environment? I have googled it but couldn't find info on it. Spark
documentation says it is possible, but no details provided.. Please help


Thanks

Sathish



On Mon, Sep 21, 2015 at 11:54 AM Tim Chen  wrote:

> Hi John,
>
> There is no other blog post yet, I'm thinking to do a series of posts but
> so far haven't get time to do that yet.
>
> Running Spark in docker containers makes distributing spark versions easy,
> it's simple to upgrade and automatically caches on the slaves so the same
> image just runs right away. Most of the docker perf is usually related to
> network and filesystem overheads, but I think with recent changes in Spark
> to make Mesos sandbox the default temp dir filesystem won't be a big
> concern as it's mostly writing to the mounted in Mesos sandbox. Also Mesos
> uses host network by default so network is affected much.
>
> Most of the cluster mode limitation is that you need to make the spark job
> files available somewhere that all the slaves can access remotely (http,
> s3, hdfs, etc) or available on all slaves locally by path.
>
> I'll try to make more doc efforts once I get my existing patches and
> testing infra work done.
>
> Let me know if you have more questions,
>
> Tim
>
> On Sat, Sep 19, 2015 at 5:42 AM, John Omernik  wrote:
>
>> I was searching in the 1.5.0 docs on the Docker on Mesos capabilities and
>> just found you CAN run it this way.  Are there any user posts, blog posts,
>> etc on why and how you'd do this?
>>
>> Basically, at first I was questioning why you'd run spark in a docker
>> container, i.e., if you run with tar balled executor, what are you really
>> gaining?  And in this setup, are you losing out on performance somehow? (I
>> am guessing smarter people than I have figured that out).
>>
>> Then I came along a situation where I wanted to use a python library with
>> spark, and it had to be installed on every node, and I realized one big
>> advantage of dockerized spark would be that spark apps that needed other
>> libraries could be contained and built well.
>>
>> OK, that's huge, let's do that.  For my next question there are lot of
>> "questions" have on how this actually works.  Does Clustermode/client mode
>> apply here? If so, how?  Is there a good walk through on getting this
>> setup? Limitations? Gotchas?  Should I just dive in an start working with
>> it? Has anyone done any stories/rough documentation? This seems like a
>> really helpful feature to scaling out spark, and letting developers truly
>> build what they need without tons of admin overhead, so I really want to
>> explore.
>>
>> Thanks!
>>
>> John
>>
>
>


Re: Docker/Mesos with Spark

2016-01-19 Thread Sathish Kumaran Vairavelu
Thank you! Looking forward for it..


On Tue, Jan 19, 2016 at 4:03 PM Tim Chen <t...@mesosphere.io> wrote:

> Hi Sathish,
>
> Sorry about that, I think that's a good idea and I'll write up a section
> in the Spark documentation page to explain how it can work. We (Mesosphere)
> have been doing this for our DCOS spark for our past releases and has been
> working well so far.
>
> Thanks!
>
> Tim
>
> On Tue, Jan 19, 2016 at 12:28 PM, Sathish Kumaran Vairavelu <
> vsathishkuma...@gmail.com> wrote:
>
>> Hi Tim
>>
>> Do you have any materials/blog for running Spark in a container in Mesos
>> cluster environment? I have googled it but couldn't find info on it. Spark
>> documentation says it is possible, but no details provided.. Please help
>>
>>
>> Thanks
>>
>> Sathish
>>
>>
>>
>>
>> On Mon, Sep 21, 2015 at 11:54 AM Tim Chen <t...@mesosphere.io> wrote:
>>
>>> Hi John,
>>>
>>> There is no other blog post yet, I'm thinking to do a series of posts
>>> but so far haven't get time to do that yet.
>>>
>>> Running Spark in docker containers makes distributing spark versions
>>> easy, it's simple to upgrade and automatically caches on the slaves so the
>>> same image just runs right away. Most of the docker perf is usually related
>>> to network and filesystem overheads, but I think with recent changes in
>>> Spark to make Mesos sandbox the default temp dir filesystem won't be a big
>>> concern as it's mostly writing to the mounted in Mesos sandbox. Also Mesos
>>> uses host network by default so network is affected much.
>>>
>>> Most of the cluster mode limitation is that you need to make the spark
>>> job files available somewhere that all the slaves can access remotely
>>> (http, s3, hdfs, etc) or available on all slaves locally by path.
>>>
>>> I'll try to make more doc efforts once I get my existing patches and
>>> testing infra work done.
>>>
>>> Let me know if you have more questions,
>>>
>>> Tim
>>>
>>> On Sat, Sep 19, 2015 at 5:42 AM, John Omernik <j...@omernik.com> wrote:
>>>
>>>> I was searching in the 1.5.0 docs on the Docker on Mesos capabilities
>>>> and just found you CAN run it this way.  Are there any user posts, blog
>>>> posts, etc on why and how you'd do this?
>>>>
>>>> Basically, at first I was questioning why you'd run spark in a docker
>>>> container, i.e., if you run with tar balled executor, what are you really
>>>> gaining?  And in this setup, are you losing out on performance somehow? (I
>>>> am guessing smarter people than I have figured that out).
>>>>
>>>> Then I came along a situation where I wanted to use a python library
>>>> with spark, and it had to be installed on every node, and I realized one
>>>> big advantage of dockerized spark would be that spark apps that needed
>>>> other libraries could be contained and built well.
>>>>
>>>> OK, that's huge, let's do that.  For my next question there are lot of
>>>> "questions" have on how this actually works.  Does Clustermode/client mode
>>>> apply here? If so, how?  Is there a good walk through on getting this
>>>> setup? Limitations? Gotchas?  Should I just dive in an start working with
>>>> it? Has anyone done any stories/rough documentation? This seems like a
>>>> really helpful feature to scaling out spark, and letting developers truly
>>>> build what they need without tons of admin overhead, so I really want to
>>>> explore.
>>>>
>>>> Thanks!
>>>>
>>>> John
>>>>
>>>
>>>
>


Re: Can a tempTable registered by sqlContext be used inside a forEachRDD?

2016-01-03 Thread Sathish Kumaran Vairavelu
I think you can use foreachpartition instead of foreachrdd


Sathish
On Sun, Jan 3, 2016 at 5:51 AM SRK  wrote:

> Hi,
>
> Can a tempTable registered in sqlContext be used to query inside forEachRDD
> as shown below?
> My requirement is that I have a set of data in the form of parquet inside
> hdfs and I need to register the data
> as a tempTable using sqlContext and query it inside forEachRDD as shown
> below.
>
>   sqlContext.registerTempTable("tempTable")
>
> messages.foreachRDD { rdd =>
>   val message:RDD[String] = rdd.map { y => y._2 }
>
>   sqlContext.sql("SELECT time,To FROM tempTable")
> }
>
> Thanks!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Can-a-tempTable-registered-by-sqlContext-be-used-inside-a-forEachRDD-tp25862.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: How to return a pair RDD from an RDD that has foreachPartition applied?

2015-11-18 Thread Sathish Kumaran Vairavelu
I think you can use mapPartitions that returns PairRDDs followed by
forEachPartition for saving it

On Wed, Nov 18, 2015 at 9:31 AM swetha kasireddy 
wrote:

> Looks like I can use mapPartitions but can it be done using
> forEachPartition?
>
> On Tue, Nov 17, 2015 at 11:51 PM, swetha 
> wrote:
>
>> Hi,
>>
>> How to return an RDD of key/value pairs from an RDD that has
>> foreachPartition applied. I have my code something like the following. It
>> looks like an RDD that has foreachPartition can have only the return type
>> as
>> Unit. How do I apply foreachPartition and do a save and at the same
>> return a
>> pair RDD.
>>
>>  def saveDataPointsBatchNew(records: RDD[(String, (Long,
>> java.util.LinkedHashMap[java.lang.Long,
>> java.lang.Float],java.util.LinkedHashMap[java.lang.Long, java.lang.Float],
>> java.util.HashSet[java.lang.String] , Boolean))])= {
>> records.foreachPartition({ partitionOfRecords =>
>>   val dataLoader = new DataLoaderImpl();
>>   var metricList = new java.util.ArrayList[String]();
>>   var storageTimeStamp = 0l
>>
>>   if (partitionOfRecords != null) {
>> partitionOfRecords.foreach(record => {
>>
>> if (record._2._1 == 0l) {
>> entrySet = record._2._3.entrySet()
>> itr = entrySet.iterator();
>> while (itr.hasNext()) {
>> val entry = itr.next();
>> storageTimeStamp = entry.getKey.toLong
>> val dayCounts = entry.getValue
>> metricsDayCounts += record._1 ->(storageTimeStamp,
>> dayCounts.toFloat)
>> }
>> }
>>}
>> }
>> )
>>   }
>>
>>   //Code to insert the last successful batch/streaming timestamp  ends
>>   dataLoader.saveDataPoints(metricList);
>>   metricList = null
>>
>> })
>>   }
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-return-a-pair-RDD-from-an-RDD-that-has-foreachPartition-applied-tp25411.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: JDBC thrift server

2015-10-08 Thread Sathish Kumaran Vairavelu
Which version of spark you are using? You might encounter SPARK-6882
 if Kerberos is enabled.

-Sathish

On Thu, Oct 8, 2015 at 10:46 AM Younes Naguib <
younes.nag...@tritondigital.com> wrote:

> Hi,
>
>
>
> We’ve been using the JDBC thrift server for a couple of weeks now and
> running queries on it like a regular RDBMS.
>
> We’re about to deploy it in a shared production cluster.
>
>
>
> Any advice, warning on a such setup. Yarn or Mesos?
>
> How about dynamic resource allocation in a already running thrift server?
>
>
>
> *Thanks,*
>
> *Younes*
>
>
>


Reading Hive Tables using SQLContext

2015-09-24 Thread Sathish Kumaran Vairavelu
Hello,

Is it possible to access Hive tables directly from SQLContext instead of
HiveContext? I am facing with errors while doing it.

Please let me know


Thanks

Sathish


Re: Reading Hive Tables using SQLContext

2015-09-24 Thread Sathish Kumaran Vairavelu
Thanks Michael. Just want to check if there is a roadmap to include Hive
tables from SQLContext.

-Sathish

On Thu, Sep 24, 2015 at 7:46 PM Michael Armbrust <mich...@databricks.com>
wrote:

> No, you have to use a HiveContext.
>
> On Thu, Sep 24, 2015 at 2:47 PM, Sathish Kumaran Vairavelu <
> vsathishkuma...@gmail.com> wrote:
>
>> Hello,
>>
>> Is it possible to access Hive tables directly from SQLContext instead of
>> HiveContext? I am facing with errors while doing it.
>>
>> Please let me know
>>
>>
>> Thanks
>>
>> Sathish
>>
>
>


Re: Best way to import data from Oracle to Spark?

2015-09-10 Thread Sathish Kumaran Vairavelu
I guess data pump export from Oracle could be fast option. Hive now has
oracle data pump serde..

https://docs.oracle.com/cd/E57371_01/doc.41/e57351/copy2bda.htm


On Wed, Sep 9, 2015 at 4:41 AM Reynold Xin  wrote:

> Using the JDBC data source is probably the best way.
> http://spark.apache.org/docs/1.4.1/sql-programming-guide.html#jdbc-to-other-databases
>
> On Tue, Sep 8, 2015 at 10:11 AM, Cui Lin  wrote:
>
>> What's the best way to import data from Oracle to Spark? Thanks!
>>
>>
>> --
>> Best regards!
>>
>> Lin,Cui
>>
>
>


Re: How to set environment of worker applications

2015-08-23 Thread Sathish Kumaran Vairavelu
spark-env.sh works for me in Spark 1.4 but not
spark.executor.extraJavaOptions.

On Sun, Aug 23, 2015 at 11:27 AM Raghavendra Pandey 
raghavendra.pan...@gmail.com wrote:

 I think the only way to pass on environment variables to worker node is to
 write it in spark-env.sh file on each worker node.

 On Sun, Aug 23, 2015 at 8:16 PM, Hemant Bhanawat hemant9...@gmail.com
 wrote:

 Check for spark.driver.extraJavaOptions and
 spark.executor.extraJavaOptions in the following article. I think you can
 use -D to pass system vars:

 spark.apache.org/docs/latest/configuration.html#runtime-environment
 Hi,

 I am starting a spark streaming job in standalone mode with spark-submit.

 Is there a way to make the UNIX environment variables with which
 spark-submit is started available to the processes started on the worker
 nodes?

 Jan
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: How do we control output part files created by Spark job?

2015-07-06 Thread Sathish Kumaran Vairavelu
Try coalesce function to limit no of part files
On Mon, Jul 6, 2015 at 1:23 PM kachau umesh.ka...@gmail.com wrote:

 Hi I am having couple of Spark jobs which processes thousands of files
 every
 day. File size may very from MBs to GBs. After finishing job I usually save
 using the following code

 finalJavaRDD.saveAsParquetFile(/path/in/hdfs); OR
 dataFrame.write.format(orc).save(/path/in/hdfs) //storing as ORC file
 as
 of Spark 1.4

 Spark job creates plenty of small part files in final output directory. As
 far as I understand Spark creates part file for each partition/task please
 correct me if I am wrong. How do we control amount of part files Spark
 creates? Finally I would like to create Hive table using these parquet/orc
 directory and I heard Hive is slow when we have large no of small files.
 Please guide I am new to Spark. Thanks in advance.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-do-we-control-output-part-files-created-by-Spark-job-tp23649.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException in spark with mysql database

2015-07-06 Thread Sathish Kumaran Vairavelu
Try including alias in the query.

val query=(select * from  +table+) a

On Mon, Jul 6, 2015 at 3:38 AM Hafiz Mujadid hafizmujadi...@gmail.com
wrote:

 Hi!
 I am trying to load data from my sql database using following code

 val query=select * from  +table+  
 val url = jdbc:mysql:// + dataBaseHost + : + dataBasePort + / +
 dataBaseName + ?user= + db_user + password= + db_pass
 val sc = new SparkContext(new
 SparkConf().setAppName(SparkJdbcDs).setMaster(local[*]))
 val sqlContext = new SQLContext(sc)
 val options = new HashMap[String, String]()
 options.put(driver, com.mysql.jdbc.Driver)
 options.put(url, url)
 options.put(dbtable, query)
 options.put(numPartitions, 1)
 sqlContext.load(jdbc, options)

 And I get following exception

 Exception in thread main
 com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an
 error
 in your SQL syntax; check the manual that corresponds to your MySQL server
 version for the right syntax to use near 'select * from  tempTable   WHERE
 1=0'



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/com-mysql-jdbc-exceptions-jdbc4-MySQLSyntaxErrorException-in-spark-with-mysql-database-tp23643.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Spark SQL JDBC Source data skew

2015-06-25 Thread Sathish Kumaran Vairavelu
Can some one help me here? Please
On Sat, Jun 20, 2015 at 9:54 AM Sathish Kumaran Vairavelu 
vsathishkuma...@gmail.com wrote:

 Hi,

 In Spark SQL JDBC data source there is an option to specify upper/lower
 bound and num of partitions. How Spark handles data distribution, if we do
 not give the upper/lower/num of parititons ? Will all data from the
 external data source skewed up in one executor?

 In many situations, we do not know the upper/lower bound of the underlying
 dataset until the query is executed, so it is not possible to pass
 upper/lower bound values.


 Thanks

 Sathish



Spark SQL JDBC Source data skew

2015-06-20 Thread Sathish Kumaran Vairavelu
Hi,

In Spark SQL JDBC data source there is an option to specify upper/lower
bound and num of partitions. How Spark handles data distribution, if we do
not give the upper/lower/num of parititons ? Will all data from the
external data source skewed up in one executor?

In many situations, we do not know the upper/lower bound of the underlying
dataset until the query is executed, so it is not possible to pass
upper/lower bound values.


Thanks

Sathish


Re: lowerupperBound not working/spark 1.3

2015-06-14 Thread Sathish Kumaran Vairavelu
Hi

I am also facing with same issue. Is it possible to view actual query
passed to the database. Has anyone tried that? Also, what if we don't give
upper and lower bound partition. Would we end up in data skew ?

Thanks

Sathish
On Sun, Jun 14, 2015 at 5:02 AM Sujeevan suje...@gmail.com wrote:

 I also thought that it is an issue. After investigating it further, found
 out this. https://issues.apache.org/jira/browse/SPARK-6800

 Here is the updated documentation of
 *org.apache.spark.sql.jdbc.JDBCRelation#columnPartition* method


 Notice that lowerBound and upperBound are just used to decide the
 partition stride, not for filtering the rows in table. So all rows in the
 table will be partitioned and returned.

 So filter has to be added manually in the query.

 val jdbcDF = sqlContext.jdbc(url =
 jdbc:postgresql://localhost:5430/dbname?user=userpassword=111, table =
 (select * from se_staging.exp_table3 where cs_id = 1 and cs_id = 1)
 as exp_table3new ,columnName=cs_id,lowerBound=1 ,upperBound = 1,
 numPartitions=12 )




 Best Regards,

 Sujeevan. N

 On Mon, Mar 23, 2015 at 4:02 AM, Ted Yu yuzhih...@gmail.com wrote:

 I went over JDBCRelation#columnPartition() but didn't find obvious clue
 (you can add more logging to confirm that the partitions were generated
 correctly).

 Looks like the issue may be somewhere else.

 Cheers

 On Sun, Mar 22, 2015 at 12:47 PM, Marek Wiewiorka 
 marek.wiewio...@gmail.com wrote:

 ...I even tried setting upper/lower bounds to the same value like 1 or
 10 with the same result.
 cs_id is a column of the cardinality ~5*10^6
 So this is not the case here.

 Regards,
 Marek

 2015-03-22 20:30 GMT+01:00 Ted Yu yuzhih...@gmail.com:

 From javadoc of JDBCRelation#columnPartition():
* Given a partitioning schematic (a column of integral type, a
 number of
* partitions, and upper and lower bounds on the column's value),
 generate

 In your example, 1 and 1 are for the value of cs_id column.

 Looks like all the values in that column fall within the range of 1 and
 1000.

 Cheers

 On Sun, Mar 22, 2015 at 8:44 AM, Marek Wiewiorka 
 marek.wiewio...@gmail.com wrote:

 Hi All - I try to use the new SQLContext API for populating DataFrame
 from jdbc data source.
 like this:

 val jdbcDF = sqlContext.jdbc(url =
 jdbc:postgresql://localhost:5430/dbname?user=userpassword=111, table =
 se_staging.exp_table3 ,columnName=cs_id,lowerBound=1 ,upperBound =
 1, numPartitions=12 )

 No matter how I set lower and upper bounds I always get all the rows
 from my table.
 The API is marked as experimental so I assume there might by some bugs
 in it but
 did anybody come across a similar issue?

 Thanks!








Spark SQL JDBC Source Join Error

2015-06-14 Thread Sathish Kumaran Vairavelu
Hello Everyone,

I pulled 2 different tables from the JDBC source and then joined them using
the cust_id *decimal* column. A simple join like as below. This simple join
works perfectly in the database but not in Spark SQL. I am importing 2
tables as a data frame/registertemptable and firing sql on top of it.
Please let me know what could be the error..

select b.customer_type, sum(a.amount) total_amount from
customer_activity a,
account b
where
a.cust_id = b.cust_id
group by b.customer_type

CastException: java.math.BigDecimal cannot be cast to
org.apache.spark.sql.types.Decimal

at
org.apache.spark.sql.types.Decimal$DecimalIsFractional$.plus(Decimal.scala:330)

at
org.apache.spark.sql.catalyst.expressions.Add.eval(arithmetic.scala:127)

at
org.apache.spark.sql.catalyst.expressions.Coalesce.eval(nullFunctions.scala:50)

at
org.apache.spark.sql.catalyst.expressions.MutableLiteral.update(literals.scala:83)

at
org.apache.spark.sql.catalyst.expressions.SumFunction.update(aggregates.scala:571)

at
org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:163)

at
org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:147)

at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)

at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)

at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)

at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)

at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)

at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)

at org.apache.spark.scheduler.Task.run(Task.scala:64)

at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)


Re: Spark SQL JDBC Source Join Error

2015-06-14 Thread Sathish Kumaran Vairavelu
Thank you.. it works in Spark 1.4.

On Sun, Jun 14, 2015 at 3:51 PM Michael Armbrust mich...@databricks.com
wrote:

 Sounds like SPARK-5456 https://issues.apache.org/jira/browse/SPARK-5456.
 Which is fixed in Spark 1.4.

 On Sun, Jun 14, 2015 at 11:57 AM, Sathish Kumaran Vairavelu 
 vsathishkuma...@gmail.com wrote:

 Hello Everyone,

 I pulled 2 different tables from the JDBC source and then joined them
 using the cust_id *decimal* column. A simple join like as below. This
 simple join works perfectly in the database but not in Spark SQL. I am
 importing 2 tables as a data frame/registertemptable and firing sql on top
 of it. Please let me know what could be the error..

 select b.customer_type, sum(a.amount) total_amount from
 customer_activity a,
 account b
 where
 a.cust_id = b.cust_id
 group by b.customer_type

 CastException: java.math.BigDecimal cannot be cast to
 org.apache.spark.sql.types.Decimal

 at
 org.apache.spark.sql.types.Decimal$DecimalIsFractional$.plus(Decimal.scala:330)

 at
 org.apache.spark.sql.catalyst.expressions.Add.eval(arithmetic.scala:127)

 at
 org.apache.spark.sql.catalyst.expressions.Coalesce.eval(nullFunctions.scala:50)

 at
 org.apache.spark.sql.catalyst.expressions.MutableLiteral.update(literals.scala:83)

 at
 org.apache.spark.sql.catalyst.expressions.SumFunction.update(aggregates.scala:571)

 at
 org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:163)

 at
 org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:147)

 at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)

 at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)

 at
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)

 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)

 at
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)

 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)

 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)

 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)

 at org.apache.spark.scheduler.Task.run(Task.scala:64)

 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)

 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

 at java.lang.Thread.run(Thread.java:745)





Spark SQL - Complex query pushdown

2015-06-14 Thread Sathish Kumaran Vairavelu
Hello,

Is there a way in spark, where I define the data source (say the JDBC
Source) and define the list of tables to be used on that data source. Like
JDBC connection, where we define the connection and run execute statement
based on that connection. In current external table implementation, each
table requires complete data source information (like url, etc).

The use case is something like I have n tables on database1 and m tables on
database2; when there is a complex query that combines both m,n tables. Can
spark sql decompose the complex query into data source specific queries say
source 1 query (with n tables) executed on database-1 and source 2 query
(with m tables) executed on database-2 and the source 1  source 2 query
result is joined/merged in spark layer to produce the final output? Will
pushdown optimization work at the data source level or at the separate
external table level?


Thanks

Sathish


Re: SparkSQL JDBC Datasources API when running on YARN - Spark 1.3.0

2015-06-11 Thread Sathish Kumaran Vairavelu
Hi Nathan,

I am also facing the issue with Spark 1.3. Did you find any workaround for
this issue? Please help


Thanks

Sathish

On Thu, Apr 16, 2015 at 6:03 AM Nathan McCarthy 
nathan.mccar...@quantium.com.au wrote:

  Its JTDS 1.3.1; http://sourceforge.net/projects/jtds/files/jtds/1.3.1/

  I put that jar in /tmp on the driver/machine I’m running spark shell
 from.

  Then I ran with ./bin/spark-shell --jars /tmp/jtds-1.3.1.jar --master
 yarn-client

  So I’m guessing that --jars doesn’t set the class path for the
 primordial class loader. And because its on the class path in ‘user land’
 I’m guessing

  Thinking a work around would be to merge my spark assembly jar with the
 jtds driver… But it seems like a hack. The other thing I notice is there is
 --file which lets me pass around files with the YARN distribute, so Im
 thinking I can somehow use this if --jars doesn’t work.

  Really I need to understand how the spark class path is set when running
 on YARN.


   From: ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com
 Date: Thursday, 16 April 2015 3:02 pm
 To: Nathan nathan.mccar...@quantium.com.au
 Cc: user@spark.apache.org user@spark.apache.org
 Subject: Re: SparkSQL JDBC Datasources API when running on YARN - Spark
 1.3.0

   Can you provide the JDBC connector jar version. Possibly the full JAR
 name and full command you ran Spark with ?

 On Wed, Apr 15, 2015 at 11:27 AM, Nathan McCarthy 
 nathan.mccar...@quantium.com.au wrote:

  Just an update, tried with the old JdbcRDD and that worked fine.

   From: Nathan nathan.mccar...@quantium.com.au
 Date: Wednesday, 15 April 2015 1:57 pm
 To: user@spark.apache.org user@spark.apache.org
 Subject: SparkSQL JDBC Datasources API when running on YARN - Spark 1.3.0

   Hi guys,

  Trying to use a Spark SQL context’s .load(“jdbc, …) method to create a
 DF from a JDBC data source. All seems to work well locally (master =
 local[*]), however as soon as we try and run on YARN we have problems.

  We seem to be running into problems with the class path and loading up
 the JDBC driver. I’m using the jTDS 1.3.1 driver,
 net.sourceforge.jtds.jdbc.Driver.

  ./bin/spark-shell --jars /tmp/jtds-1.3.1.jar --master yarn-client

  When trying to run I get an exception;

  scala sqlContext.load(jdbc, Map(url -
 jdbc:jtds:sqlserver://blah:1433/MyDB;user=usr;password=pwd, dbtable -
 CUBE.DIM_SUPER_STORE_TBL”))

  java.sql.SQLException: No suitable driver found for
 jdbc:jtds:sqlserver://blah:1433/MyDB;user=usr;password=pwd

  Thinking maybe we need to force load the driver, if I supply *“driver”
 - “net.sourceforge.jtds.jdbc.Driver”* to .load we get;

  scala sqlContext.load(jdbc, Map(url -
 jdbc:jtds:sqlserver://blah:1433/MyDB;user=usr;password=pwd, driver -
 net.sourceforge.jtds.jdbc.Driver, dbtable -
 CUBE.DIM_SUPER_STORE_TBL”))

   java.lang.ClassNotFoundException: net.sourceforge.jtds.jdbc.Driver
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:191)
 at
 org.apache.spark.sql.jdbc.DefaultSource.createRelation(JDBCRelation.scala:97)
 at
 org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:290)
 at org.apache.spark.sql.SQLContext.load(SQLContext.scala:679)
 at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:21)

  Yet if I run a Class.forName() just from the shell;

  scala Class.forName(net.sourceforge.jtds.jdbc.Driver)
 res1: Class[_] = class net.sourceforge.jtds.jdbc.Driver

  No problem finding the JAR. I’ve tried in both the shell, and running
 with spark-submit (packing the driver in with my application as a fat JAR).
 Nothing seems to work.

  I can also get a connection in the driver/shell no problem;

  scala import java.sql.DriverManager
 import java.sql.DriverManager
  scala
 DriverManager.getConnection(jdbc:jtds:sqlserver://blah:1433/MyDB;user=usr;password=pwd)
 res3: java.sql.Connection =
 net.sourceforge.jtds.jdbc.JtdsConnection@2a67ecd0

  I’m probably missing some class path setting here. In
 *jdbc.DefaultSource.createRelation* it looks like the call to
 *Class.forName* doesn’t specify a class loader so it just uses the
 default Java behaviour to reflectively get the class loader. It almost
 feels like its using a different class loader.

  I also tried seeing if the class path was there on all my executors by
 running;

  *import *scala.collection.JavaConverters._

 sc.parallelize(*Seq*(1,2,3,4)).flatMap(_ = java.sql.DriverManager.
 *getDrivers*().asScala.map(d = *s**”**$*d* | **$*{d.acceptsURL(
 

Drools in Spark

2015-04-07 Thread Sathish Kumaran Vairavelu
Hello,

Just want to check if anyone has tried drools with Spark? Please let me
know. Are there any alternate rule engine that works well with Spark?


Thanks
Sathish


Error in SparkSQL/Scala IDE

2015-04-02 Thread Sathish Kumaran Vairavelu
Hi Everyone,

I am getting following error while registering table using Scala IDE.
Please let me know how to resolve this error. I am using Spark 1.2.1

  import sqlContext.createSchemaRDD



  val empFile = sc.textFile(/tmp/emp.csv, 4)

  .map ( _.split(,) )

  .map( row= Employee(row(0),row(1), row(2), row(3), row(4
)))

  empFile.registerTempTable(Employees)

Thanks

Sathish

Exception in thread main scala.reflect.internal.MissingRequirementError:
class org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with
primordial classloader with boot classpath
[/Applications/eclipse/plugins/org.scala-ide.scala210.jars_4.0.0.201412161056/target/jars/scala-library.jar:/Applications/eclipse/plugins/org.scala-ide.scala210.jars_4.0.0.201412161056/target/jars/scala-reflect.jar:/Applications/eclipse/plugins/org.scala-ide.scala210.jars_4.0.0.201412161056/target/jars/scala-actor.jar:/Applications/eclipse/plugins/org.scala-ide.scala210.jars_4.0.0.201412161056/target/jars/scala-swing.jar:/Applications/eclipse/plugins/org.scala-ide.scala210.jars_4.0.0.201412161056/target/jars/scala-compiler.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_65.jdk/Contents/Home/jre/lib/resources.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_65.jdk/Contents/Home/jre/lib/rt.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_65.jdk/Contents/Home/jre/lib/sunrsasign.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_65.jdk/Contents/Home/jre/lib/jsse.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_65.jdk/Contents/Home/jre/lib/jce.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_65.jdk/Contents/Home/jre/lib/charsets.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_65.jdk/Contents/Home/jre/lib/jfr.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_65.jdk/Contents/Home/jre/classes]
not found.

at scala.reflect.internal.MissingRequirementError$.signal(
MissingRequirementError.scala:16)

at scala.reflect.internal.MissingRequirementError$.notFound(
MissingRequirementError.scala:17)

at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(
Mirrors.scala:48)

at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(
Mirrors.scala:61)

at scala.reflect.internal.Mirrors$RootsBase.staticModuleOrClass(
Mirrors.scala:72)

at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:119)

at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:21)

at org.apache.spark.sql.catalyst.ScalaReflection$$typecreator1$1.apply(
ScalaReflection.scala:115)

at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(
TypeTags.scala:231)

at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231)

at scala.reflect.api.TypeTags$class.typeOf(TypeTags.scala:335)

at scala.reflect.api.Universe.typeOf(Universe.scala:59)

at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(
ScalaReflection.scala:115)

at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(
ScalaReflection.scala:33)

at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(
ScalaReflection.scala:100)

at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(
ScalaReflection.scala:33)

at org.apache.spark.sql.catalyst.ScalaReflection$class.attributesFor(
ScalaReflection.scala:94)

at org.apache.spark.sql.catalyst.ScalaReflection$.attributesFor(
ScalaReflection.scala:33)

at org.apache.spark.sql.SQLContext.createSchemaRDD(SQLContext.scala:111)

at com.svairavelu.examples.QueryCSV$.main(QueryCSV.scala:24)

 at com.svairavelu.examples.QueryCSV.main(QueryCSV.scala)


Checking Data Integrity in Spark

2015-03-27 Thread Sathish Kumaran Vairavelu
Hello,

I want to check if there is any way to check the data integrity of the data
files. The use case is perform data integrity check on large files 100+
columns and reject records (write it another file) that does not meet
criteria's (such as NOT NULL, date format, etc). Since there are lot of
columns/integrity rules we should able to data integrity check through
configurations (like xml, json, etc); Please share your thoughts..


Thanks

Sathish


reduceByKey vs countByKey

2015-02-24 Thread Sathish Kumaran Vairavelu
Hello,

Quick question. I am trying to understand difference between reduceByKey vs
countByKey? Which one gives better performance reduceByKey or countByKey?
While we can perform same count operation using reduceByKey why we need
countByKey/countByValue?

Thanks

Sathish


Re: Publishing streaming results to web interface

2015-01-02 Thread Sathish Kumaran Vairavelu
Try and see if this helps. http://zeppelin-project.org/

-Sathish


On Fri Jan 02 2015 at 8:20:54 PM Pankaj Narang pankajnaran...@gmail.com
wrote:

 Thomus,

 Spark does not provide any web interface directly. There might be third
 party apps providing dashboards
 but I am not aware of any for the same purpose.

 *You can use some methods so that this data is saved on file system instead
 of being printed on screen

 Some of the methods you can use ON RDD are saveAsObjectFile, saveAsFile
 *


 Now you can read these files to show them on web interface in  any language
 of your choice

 Regards
 Pankaj






 --
 View this message in context: http://apache-spark-user-list.
 1001560.n3.nabble.com/Publishing-streaming-results-to-web-interface-
 tp20948p20949.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Spark SQL JSON dataset query nested datastructures

2014-08-09 Thread Sathish Kumaran Vairavelu
I have a simple JSON dataset as below. How do I query all parts.lock for
the id=1.

JSON: { id: 1, name: A green door, price: 12.50, tags: [home,
green], parts : [ { lock : One lock, key : single key }, {
lock : 2 lock, key : 2 key } ] }

Query: select id,name,price,parts.lockfrom product where id=1

The point is if I use parts[0].lock it will return 1 row as below:

{u'price': 12.5, u'id': 1, u'.lock': {u'lock': u'One lock', u'key':
u'single key'}, u'name': u'A green door'}

But I want to return all the locks the in the parts structure. It will
return multiple rows but thats the one I am looking for. This kind of a
relational join which I want to accomplish.

Please help me with this


Spark SQL dialect

2014-08-08 Thread Sathish Kumaran Vairavelu
Hi,

Can you anyone point me where to find the sql dialect for Spark SQL? Unlike
HQL, there are lot of tasks involved in creating and querying tables which
is very cumbersome one. If we have to fire multiple queries on 10's and
100's of tables then it is very difficult at this point. Given Spark SQL
will replace shark if we have nice way to handle DDL and DML operations,
that would be awesome. We shall use it for ETL and BI querying in a
seamless fashion.

Can anyone please help me with this?


Thanks

Sathish


Using Python IDE for Spark Application Development

2014-08-06 Thread Sathish Kumaran Vairavelu
Hello,

I am trying to use the python IDE PyCharm for Spark application
development. How can I use pyspark with Python IDE? Can anyone help me with
this?


Thanks

Sathish


Re: Using Python IDE for Spark Application Development

2014-08-06 Thread Sathish Kumaran Vairavelu
Mohit, This doesn't seems to be working can you please provide more
details? when I use from pyspark import SparkContext it is disabled in
pycharm. I use pycharm community edition. Where should I set the
environment variables in same python script or different python script?

Also, should I run any Spark local cluster so Spark program runs on top of
that?


Appreciate your help

-Sathish


On Wed, Aug 6, 2014 at 6:22 PM, Mohit Singh mohit1...@gmail.com wrote:

 My naive set up..
 Adding
 os.environ['SPARK_HOME'] = /path/to/spark
 sys.path.append(/path/to/spark/python)
 on top of my script.
 from pyspark import SparkContext
 from pyspark import SparkConf
 Execution works from within pycharm...

 Though my next step is to figure out autocompletion and I bet there are
 better ways to develop apps for spark..



 On Wed, Aug 6, 2014 at 4:16 PM, Sathish Kumaran Vairavelu 
 vsathishkuma...@gmail.com wrote:

 Hello,

 I am trying to use the python IDE PyCharm for Spark application
 development. How can I use pyspark with Python IDE? Can anyone help me with
 this?


 Thanks

 Sathish





 --
 Mohit

 When you want success as badly as you want the air, then you will get it.
 There is no other secret of success.
 -Socrates