from:"Todd Nist"

Re: Exception handling in Spark

2020-05-05 Thread Todd Nist

Could you do something like this prior to calling the action.

// Create FileSystem object from Hadoop Configuration
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
// This methods returns Boolean (true - if file exists, false - if file
doesn't exist
val fileExists = fs.exists(new Path(""))
if (fileExists) println("File exists!")
else println("File doesn't exist!")

Not sure that will help you or not, just a thought.

-Todd




On Tue, May 5, 2020 at 11:45 AM Mich Talebzadeh 
wrote:

> Thanks  Brandon!
>
> i should have remembered that.
>
> basically the code gets out with sys.exit(1)  if it cannot find the file
>
> I guess there is no easy way of validating DF except actioning it by
> show(1,0) etc and checking if it works?
>
> Regards,
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 5 May 2020 at 16:41, Brandon Geise  wrote:
>
>> You could use the Hadoop API and check if the file exists.
>>
>>
>>
>> *From: *Mich Talebzadeh 
>> *Date: *Tuesday, May 5, 2020 at 11:25 AM
>> *To: *"user @spark" 
>> *Subject: *Exception handling in Spark
>>
>>
>>
>> Hi,
>>
>>
>>
>> As I understand exception handling in Spark only makes sense if one
>> attempts an action as opposed to lazy transformations?
>>
>>
>>
>> Let us assume that I am reading an XML file from the HDFS directory  and
>> create a dataframe DF on it
>>
>>
>>
>> val broadcastValue = "123456789"  // I assume this will be sent as a
>> constant for the batch
>>
>> // Create a DF on top of XML
>> val df = spark.read.
>> format("com.databricks.spark.xml").
>> option("rootTag", "hierarchy").
>> option("rowTag", "sms_request").
>> load("/tmp/broadcast.xml")
>>
>> val newDF = df.withColumn("broadcastid", lit(broadcastValue))
>>
>> newDF.createOrReplaceTempView("tmp")
>>
>>   // Put data in Hive table
>>   //
>>   sqltext = """
>>   INSERT INTO TABLE michtest.BroadcastStaging PARTITION
>> (broadcastid="123456", brand)
>>   SELECT
>>   ocis_party_id AS partyId
>> , target_mobile_no AS phoneNumber
>> , brand
>> , broadcastid
>>   FROM tmp
>>   """
>> //
>>
>> // Here I am performing a collection
>>
>> try  {
>>
>>  spark.sql(sqltext)
>>
>> } catch {
>>
>> case e: SQLException => e.printStackTrace
>>
>> sys.exit()
>>
>> }
>>
>>
>>
>> Now the issue I have is that what if the xml file  /tmp/broadcast.xml
>> does not exist or deleted? I won't be able to catch the error until the
>> hive table is populated. Of course I can write a shell script to check if
>> the file exist before running the job or put small collection like
>> df.show(1,0). Are there more general alternatives?
>>
>>
>>
>> Thanks
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn  
>> *https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>

Re: Using P4J Plugins with Spark

2020-04-21 Thread Todd Nist

You may want to make sure you include the jar of P4J and your plugins as
part of the following so that both the driver and executors have access.
If HDFS is out then you could
make a common mount point on each of the executor nodes so they have access
to the classes.


   - spark-submit --jars /common/path/to/jars
   - spark.driver.extraClassPath or it's alias --driver-class-path to set
   extra classpaths on the node running the driver.
   - spark.executor.extraClassPath to set extra class path on the Worker
   nodes.


On Tue, Apr 21, 2020 at 1:13 AM Shashanka Balakuntala <
shbalakunt...@gmail.com> wrote:

> Hi users,
> I'm a bit of newbie to spark infrastructure. And i have a small doubt.
> I have a maven project with plugins generated separately in a folder and
> normal java command to run is as follows:
> `java -Dp4j.pluginsDir=./plugins -jar /path/to/jar`
>
> Now when I run this program in local with spark-submit with standalone
> cluster(not cluster mode) the program compiles and plugins are in "plugins"
> folder in the $SPARK_HOME and it is getting recognised.
> The same is not the case in cluster mode. It says the Extenstion point is
> not loaded. Please advise on how can i create a folder which can be shared
> among the workers in "plugin" folder.
>
> PS: HDFS is not an options as we dont have a different setup
>
> Thanks.
>
>
> *Regards*
>   Shashanka Balakuntala Srinivasa
>
>

Re: spark.submit.deployMode: cluster

2019-03-29 Thread Todd Nist

A little late, but have you looked at https://livy.incubator.apache.org/,
works well for us.

-Todd

On Thu, Mar 28, 2019 at 9:33 PM Jason Nerothin 
wrote:

> Meant this one: https://docs.databricks.com/api/latest/jobs.html
>
> On Thu, Mar 28, 2019 at 5:06 PM Pat Ferrel  wrote:
>
>> Thanks, are you referring to
>> https://github.com/spark-jobserver/spark-jobserver or the undocumented
>> REST job server included in Spark?
>>
>>
>> From: Jason Nerothin  
>> Reply: Jason Nerothin  
>> Date: March 28, 2019 at 2:53:05 PM
>> To: Pat Ferrel  
>> Cc: Felix Cheung  ,
>> Marcelo Vanzin  , user
>>  
>> Subject:  Re: spark.submit.deployMode: cluster
>>
>> Check out the Spark Jobs API... it sits behind a REST service...
>>
>>
>> On Thu, Mar 28, 2019 at 12:29 Pat Ferrel  wrote:
>>
>>> ;-)
>>>
>>> Great idea. Can you suggest a project?
>>>
>>> Apache PredictionIO uses spark-submit (very ugly) and Apache Mahout only
>>> launches trivially in test apps since most uses are as a lib.
>>>
>>>
>>> From: Felix Cheung 
>>> 
>>> Reply: Felix Cheung 
>>> 
>>> Date: March 28, 2019 at 9:42:31 AM
>>> To: Pat Ferrel  , Marcelo
>>> Vanzin  
>>> Cc: user  
>>> Subject:  Re: spark.submit.deployMode: cluster
>>>
>>> If anyone wants to improve docs please create a PR.
>>>
>>> lol
>>>
>>>
>>> But seriously you might want to explore other projects that manage job
>>> submission on top of spark instead of rolling your own with spark-submit.
>>>
>>>
>>> --
>>> *From:* Pat Ferrel 
>>> *Sent:* Tuesday, March 26, 2019 2:38 PM
>>> *To:* Marcelo Vanzin
>>> *Cc:* user
>>> *Subject:* Re: spark.submit.deployMode: cluster
>>>
>>> Ahh, thank you indeed!
>>>
>>> It would have saved us a lot of time if this had been documented. I
>>> know, OSS so contributions are welcome… I can also imagine your next
>>> comment; “If anyone wants to improve docs see the Apache contribution rules
>>> and create a PR.” or something like that.
>>>
>>> BTW the code where the context is known and can be used is what I’d call
>>> a Driver and since all code is copied to nodes and is know in jars, it was
>>> not obvious to us that this rule existed but it does make sense.
>>>
>>> We will need to refactor our code to use spark-submit it appears.
>>>
>>> Thanks again.
>>>
>>>
>>> From: Marcelo Vanzin  
>>> Reply: Marcelo Vanzin  
>>> Date: March 26, 2019 at 1:59:36 PM
>>> To: Pat Ferrel  
>>> Cc: user  
>>> Subject:  Re: spark.submit.deployMode: cluster
>>>
>>> If you're not using spark-submit, then that option does nothing.
>>>
>>> If by "context creation API" you mean "new SparkContext()" or an
>>> equivalent, then you're explicitly creating the driver inside your
>>> application.
>>>
>>> On Tue, Mar 26, 2019 at 1:56 PM Pat Ferrel 
>>> wrote:
>>> >
>>> > I have a server that starts a Spark job using the context creation
>>> API. It DOES NOY use spark-submit.
>>> >
>>> > I set spark.submit.deployMode = “cluster”
>>> >
>>> > In the GUI I see 2 workers with 2 executors. The link for running
>>> application “name” goes back to my server, the machine that launched the
>>> job.
>>> >
>>> > This is spark.submit.deployMode = “client” according to the docs. I
>>> set the Driver to run on the cluster but it runs on the client, ignoring
>>> the spark.submit.deployMode.
>>> >
>>> > Is this as expected? It is documented nowhere I can find.
>>> >
>>>
>>>
>>> --
>>> Marcelo
>>>
>>> --
>> Thanks,
>> Jason
>>
>>
>
> --
> Thanks,
> Jason
>

Re: cache table vs. parquet table performance

2019-01-16 Thread Todd Nist

Hi Tomas,

Have you considered using something like https://www.alluxio.org/ for you
cache?  Seems like a possible solution for what your trying to do.

-Todd

On Tue, Jan 15, 2019 at 11:24 PM 大啊  wrote:

> Hi ,Tomas.
> Thanks for your question give me some prompt.But the best way use cache
> usually stores smaller data.
> I think cache large data will consume memory or disk space too much.
> Spill the cached data in parquet format maybe a good improvement.
>
> At 2019-01-16 02:20:56, "Tomas Bartalos"  wrote:
>
> Hello,
>
> I'm using spark-thrift server and I'm searching for best performing
> solution to query hot set of data. I'm processing records with nested
> structure, containing subtypes and arrays. 1 record takes up several KB.
>
> I tried to make some improvement with cache table:
>
> cache table event_jan_01 as select * from events where day_registered =
> 20190102;
>
>
> If I understood correctly, the data should be stored in *in-memory
> columnar* format with storage level MEMORY_AND_DISK. So data which
> doesn't fit to memory will be spille to disk (I assume also in columnar
> format (?))
> I cached 1 day of data (1 M records) and according to spark UI storage tab
> none of the data was cached to memory and everything was spilled to disk.
> The size of the data was *5.7 GB.*
> Typical queries took ~ 20 sec.
>
> Then I tried to store the data to parquet format:
>
> CREATE TABLE event_jan_01_par USING parquet location "/tmp/events/jan/02"
> as
>
> select * from event_jan_01;
>
>
> The whole parquet took up only *178MB.*
> And typical queries took 5-10 sec.
>
> Is it possible to tune spark to spill the cached data in parquet format ?
> Why the whole cached table was spilled to disk and nothing stayed in
> memory ?
>
> Spark version: 2.4.0
>
> Best regards,
> Tomas
>
>
>
>
>

Re: Backpressure initial rate not working

2018-07-26 Thread Todd Nist

Have you tried reducing the maxRatePerPartition to a lower value?  Based on
your settings, I believe you are going to be able to pull *600K* worth of
messages from Kafka, basically:

  • maxRatePerPartition=15000

• batchInterval 10s

• 4 partitions on Ingest topic


This results in a maximum ingest rate of 600K:


• 4 * 10 * 15000 = 600,000 max

Can you reduce the maxRatePerPartition to say 1500 for a test run?  That
should result in a more manageable  batch and you can adjust from there.


• 4 * 10 * 1500 = 60,000 max

I know we are not setting the maxRate or initialRate, only the
maxRatePerPartition and backpressure.enabled.  I thought that maxRate was
not applicable when using back pressure, but may be mistaken.


-Todd






On Thu, Jul 26, 2018 at 8:46 AM Biplob Biswas 
wrote:

> Hi Todd,
>
> Thanks for the reply. I have the mayxRatePerPartition set as well. Below
> is the spark submit config we used and still got the issue. Also the *batch
> interval is set at 10s* and *number of partitions on the topic is set to
> 4*  :
>
> spark2-submit --name "${YARN_NAME}" \
>--master yarn \
>--deploy-mode ${DEPLOY_MODE} \
>--num-executors ${NUM_EXECUTORS} \
>--driver-cores ${NUM_DRIVER_CORES} \
>--executor-cores ${NUM_EXECUTOR_CORES} \
>--driver-memory ${DRIVER_MEMORY} \
>--executor-memory ${EXECUTOR_MEMORY} \
>--queue ${YARN_QUEUE} \
>--keytab ${KEYTAB}-yarn \
>--principal ${PRINCIPAL} \
>--conf "spark.yarn.preserve.staging.files=true" \
>--conf "spark.yarn.submit.waitAppCompletion=false" \
>--conf "spark.shuffle.service.enabled=true" \
>--conf "spark.dynamicAllocation.enabled=true" \
>--conf "spark.dynamicAllocation.minExecutors=1" \
>--conf "spark.streaming.backpressure.enabled=true" \
>--conf "spark.streaming.receiver.maxRate=15000" \
>--conf "spark.streaming.kafka.maxRatePerPartition=15000" \
>--conf "spark.streaming.backpressure.initialRate=2000" \
>--conf 
> "spark.executor.extraClassPath=/opt/cloudera/parcels/CDH/lib/hbase/conf:/opt/cloudera/parcels/CDH/lib/hbase/lib/"
>  \
>--driver-class-path 
> "/opt/cloudera/parcels/CDH/lib/hbase/conf:/opt/cloudera/parcels/CDH/lib/hbase/lib/"
>  \
>--driver-java-options "-Djava.security.auth.login.config=./jaas.conf 
> -Dlog4j.configuration=log4j-spark.properties" \
>--conf 
> "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=./jaas.conf
>  -Dlog4j.configuration=log4j-spark.properties" \
>--files "${JAAS_CONF},${KEYTAB}" \
>--class "${MAIN_CLASS}" \
>"${ARTIFACT_FILE}"
>
>
> The first batch is huge, even if it worked for the first batch I would've
> tried researching more. The problem is that the first batch is more than
> 500k records.
>
> Thanks & Regards
> Biplob Biswas
>
>
> On Thu, Jul 26, 2018 at 2:33 PM Todd Nist  wrote:
>
>> Hi Biplob,
>>
>> How many partitions are on the topic you are reading from and have you
>> set the maxRatePerPartition?  iirc, spark back pressure is calculated as
>> follows:
>>
>> *Spark back pressure:*
>>
>> Back pressure is calculated off of the following:
>>
>>
>> • maxRatePerPartition=200
>>
>> • batchInterval 30s
>>
>> • 3 partitions on Ingest topic
>>
>>
>> This results in a maximum ingest rate of 18K:
>>
>>
>> • 3 * 30 * 200 = 18 max
>>
>> The spark.streaming.backpressure.initialRate only applies to the first
>> batch, per docs:
>>
>>
>> This is the initial maximum receiving rate at which each receiver will
>>> receive data for the *first batch* when the backpressure mechanism is
>>> enabled.
>>
>>
>> If you  set the maxRatePerPartition and apply the above formula, I
>> believe you will be able to achieve the results you are looking for.
>>
>> HTH.
>>
>> -Todd
>>
>>
>> On Thu, Jul 26, 2018 at 7:21 AM Biplob Biswas 
>> wrote:
>>
>>> Did anyone face similar issue? and any viable way to solve this?
>>> Thanks & Regards
>>> Biplob Biswas
>>>
>>>
>>> On Wed, Jul 25, 2018 at 4:23 PM Biplob Biswas 
>>> wrote:
>>>
>>>> I have enabled the spark.streaming.backpressure.enabled setting and
>>>> also set spark.streaming.backpressure.initialRate  to 15000, but my
>>>> spark job is not respecting these settings when reading from Kafka after a
>>>> failure.
>>>>
>>>> In my kafka topic around 500k records are waiting for being processed
>>>> and they are all taken in 1 huge batch which ultimately takes a long time
>>>> and fails with executor failure exception. We don't have more resources to
>>>> give in our test cluster and we expect the backpressure to kick in and take
>>>> smaller batches.
>>>>
>>>> What can I be doing wrong?
>>>>
>>>>
>>>> Thanks & Regards
>>>> Biplob Biswas
>>>>
>>>

Re: Backpressure initial rate not working

2018-07-26 Thread Todd Nist

Hi Biplob,

How many partitions are on the topic you are reading from and have you set
the maxRatePerPartition?  iirc, spark back pressure is calculated as
follows:

*Spark back pressure:*

Back pressure is calculated off of the following:

• maxRatePerPartition=200

• batchInterval 30s

• 3 partitions on Ingest topic

This results in a maximum ingest rate of 18K:

• 3 * 30 * 200 = 18 max

The spark.streaming.backpressure.initialRate only applies to the first
batch, per docs:

This is the initial maximum receiving rate at which each receiver will
> receive data for the *first batch* when the backpressure mechanism is
> enabled.

If you  set the maxRatePerPartition and apply the above formula, I believe
you will be able to achieve the results you are looking for.

HTH.

-Todd

On Thu, Jul 26, 2018 at 7:21 AM Biplob Biswas 
wrote:

> Did anyone face similar issue? and any viable way to solve this?
> Thanks & Regards
> Biplob Biswas
>
>
> On Wed, Jul 25, 2018 at 4:23 PM Biplob Biswas 
> wrote:
>
>> I have enabled the spark.streaming.backpressure.enabled setting and also
>>  set spark.streaming.backpressure.initialRate  to 15000, but my spark
>> job is not respecting these settings when reading from Kafka after a
>> failure.
>>
>> In my kafka topic around 500k records are waiting for being processed and
>> they are all taken in 1 huge batch which ultimately takes a long time and
>> fails with executor failure exception. We don't have more resources to give
>> in our test cluster and we expect the backpressure to kick in and take
>> smaller batches.
>>
>> What can I be doing wrong?
>>
>>
>> Thanks & Regards
>> Biplob Biswas
>>
>

Re: Tableau BI on Spark SQL

2017-01-30 Thread Todd Nist

Hi Mich,

You could look at http://www.exasol.com/.  It works very well with Tableau
without the need to extract the data.  Also in V6, it has the virtual
schemas which would allow you to access data in Spark, Hive, Oracle, or
other sources.

May be outside of what you are looking for, it works well for us.  We did
the extract route originally, but with the native Exasol connector it is
just a performant as the extract.

HTH.

-Todd


On Mon, Jan 30, 2017 at 10:15 PM, Jörn Franke  wrote:

> With a lot of data (TB) it is not that good, hence the extraction.
> Otherwise you have to wait every time you do drag and drop. With the
> extracts it is better.
>
> On 30 Jan 2017, at 22:59, Mich Talebzadeh 
> wrote:
>
> Thanks Jorn,
>
> So Tableau uses its own in-memory representation as I guessed. Now the
> question is how is performance accessing data in Oracle tables>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 30 January 2017 at 21:51, Jörn Franke  wrote:
>
>> Depending on the size of the data i recommend to schedule regularly an
>> extract in tableau. There tableau converts it to an internal in-memory
>> representation outside of Spark (can also exist on disk if memory is too
>> small) and then use it within Tableau. Accessing directly  the database is
>> not so efficient.
>> Additionally use always the newest version of tableau..
>>
>> On 30 Jan 2017, at 21:57, Mich Talebzadeh 
>> wrote:
>>
>> Hi,
>>
>> Has anyone tried using Tableau on Spark SQL?
>>
>> Specifically how does Tableau handle in-memory capabilities of Spark.
>>
>> As I understand Tableau uses its own propriety SQL against say Oracle.
>> That is well established. So for each product Tableau will try to use its
>> own version of SQL against that product  like Spark
>> or Hive.
>>
>> However, when I last tried Tableau on Hive, the mapping and performance
>> was not that good in comparision with the same tables and data in Hive..
>>
>> My approach has been to take Oracle 11.g sh schema
>> containing
>> star schema and create and ingest the same tables and data  into Hive
>> tables. Then run Tableau against these tables and do the performance
>> comparison. Given that Oracle is widely used with Tableau this test makes
>> sense?
>>
>> Thanks.
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>

Re: is there any bug for the configuration of spark 2.0 cassandra spark connector 2.0 and cassandra 3.0.8

2016-09-20 Thread Todd Nist

These types of questions would be better asked on the user mailing list for
the Spark Cassandra connector:

http://groups.google.com/a/lists.datastax.com/forum/#!forum/spark-connector-user

Version compatibility can be found here:

https://github.com/datastax/spark-cassandra-connector#version-compatibility

The JIRA, https://datastax-oss.atlassian.net/browse/SPARKC/, does not seem
to show any outstanding issues with regards to 3.0.8 and 2.0 of Spark or
Spark Cassandra Connector.

HTH.

-Todd

On Tue, Sep 20, 2016 at 1:47 AM, muhammet pakyürek 
wrote:

>
>
> please tell me the configuration including the most recent version of
> cassandra, spark and cassandra spark connector
>

Re: Is there such thing as cache fusion with the underlying tables/files on HDFS

2016-09-17 Thread Todd Nist

Hi Mich,

Have you looked at Apache Ignite?  https://apacheignite-fs.readme.io/docs.

This looks like something that may be what your looking for:

http://apacheignite.gridgain.org/docs/data-analysis-with-apache-zeppelin

HTH.

-Todd


On Sat, Sep 17, 2016 at 12:53 PM, Mich Talebzadeh  wrote:

> Hi,
>
> I am seeing similar issues when I was working on Oracle with Tableau as
> the dashboard.
>
> Currently I have a batch layer that gets streaming data from
>
> source -> Kafka -> Flume -> HDFS
>
> It stored on HDFS as text files and a cron process sinks Hive table with
> the the external table build on the directory. I tried both ORC and Parquet
> but I don't think the query itself is the issue.
>
> Meaning it does not matter how clever your execution engine is, the fact
> you still have to do  considerable amount of Physical IO (PIO) as opposed
> to Logical IO (LIO) to get the data to Zeppelin is on the critical path.
>
> One option is to limit the amount of data in Zeppelin to certain number of
> rows or something similar. However, you cannot tell a user he/she cannot
> see the full data.
>
> We resolved this with Oracle by using Oracle TimesTen
> IMDB
> to cache certain tables in memory and get them refreshed (depending on
> refresh frequency) from the underlying table in Oracle when data is
> updated). That is done through cache fusion.
>
> I was looking around and came across Alluxio .
> Ideally I like to utilise such concept like TimesTen. Can one distribute
> Hive table data (or any table data) across the nodes cached. In that case
> we will be doing Logical IO which is about 20 times or more lightweight
> compared to Physical IO.
>
> Anyway this is the concept.
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>

Re: Creating HiveContext withing Spark streaming

2016-09-08 Thread Todd Nist

Hi Mich,

Perhaps the issue is having multiple SparkContexts in the same JVM (
https://issues.apache.org/jira/browse/SPARK-2243).
While it is possible, I don't think it is encouraged.

As you know, the call your currently invoking to create the
StreamingContext also creates a
SparkContext.

/** * Create a StreamingContext by providing the configuration necessary
for a new SparkContext.
* @param conf a org.apache.spark.SparkConf object specifying Spark
parameters
* @param batchDuration the time interval at which streaming data will be
divided into batches
*/
def this(conf: SparkConf, batchDuration: Duration) = {
this(StreamingContext.createNewSparkContext(conf), null, batchDuration)
}


Could you rearrange the code slightly to either create the SparkContext
first and pass that to the creation of the StreamContext
like below:

val sc = new SparkContext(sparkConf)
val streamingContext = new StreamingContext(sc, Seconds(batchInterval))

*val HiveContext = new HiveContext(sc)*

Or remove / replace the line in red from your code and just set the val
sparkContext = streamingContext.sparkContext.

val streamingContext = new StreamingContext(sparkConf,
Seconds(batchInterval))
*val sparkContext  = new SparkContext(sparkConf)*
val HiveContext = new HiveContext(streamingContext.sparkContext)

HTH.

-Todd


On Thu, Sep 8, 2016 at 9:11 AM, Mich Talebzadeh 
wrote:

> Ok I managed to sort that one out.
>
> This is what I am facing
>
>  val sparkConf = new SparkConf().
>  setAppName(sparkAppName).
>  set("spark.driver.allowMultipleContexts", "true").
>  set("spark.hadoop.validateOutputSpecs", "false")
>  // change the values accordingly.
>  sparkConf.set("sparkDefaultParllelism",
> sparkDefaultParallelismValue)
>  sparkConf.set("sparkSerializer", sparkSerializerValue)
>  sparkConf.set("sparkNetworkTimeOut",
> sparkNetworkTimeOutValue)
>  // If you want to see more details of batches please increase
> the value
>  // and that will be shown UI.
>  sparkConf.set("sparkStreamingUiRetainedBatches",
>sparkStreamingUiRetainedBatchesValue)
>  sparkConf.set("sparkWorkerUiRetainedDrivers",
>sparkWorkerUiRetainedDriversValue)
>  sparkConf.set("sparkWorkerUiRetainedExecutors",
>sparkWorkerUiRetainedExecutorsValue)
>  sparkConf.set("sparkWorkerUiRetainedStages",
>sparkWorkerUiRetainedStagesValue)
>  sparkConf.set("sparkUiRetainedJobs",
> sparkUiRetainedJobsValue)
>  sparkConf.set("enableHiveSupport",enableHiveSupportValue)
>  sparkConf.set("spark.streaming.stopGracefullyOnShutdown","
> true")
>  sparkConf.set("spark.streaming.receiver.writeAheadLog.enable",
> "true")
>  
> sparkConf.set("spark.streaming.driver.writeAheadLog.closeFileAfterWrite",
> "true")
>  
> sparkConf.set("spark.streaming.receiver.writeAheadLog.closeFileAfterWrite",
> "true")
>  var sqltext = ""
>  val batchInterval = 2
>  val streamingContext = new StreamingContext(sparkConf,
> Seconds(batchInterval))
>
> With the above settings,  Spark streaming works fine. *However, after
> adding the first line below (in red)*
>
> *val sparkContext  = new SparkContext(sparkConf)*
> val HiveContext = new HiveContext(streamingContext.sparkContext)
>
> I get the following errors:
>
> 16/09/08 14:02:32 ERROR JobScheduler: Error running job streaming job
> 1473339752000 ms.0
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 1
> in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage
> 0.0 (TID 7, 50.140.197.217): java.io.IOException:
> *org.apache.spark.SparkException: Failed to get broadcast_0_piece0 of
> broadcast_0*at org.apache.spark.util.Utils$.
> tryOrIOException(Utils.scala:1260)
> at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(
> TorrentBroadcast.scala:174)
> at org.apache.spark.broadcast.TorrentBroadcast._value$
> lzycompute(TorrentBroadcast.scala:65)
> at org.apache.spark.broadcast.TorrentBroadcast._value(
> TorrentBroadcast.scala:65)
> at org.apache.spark.broadcast.TorrentBroadcast.getValue(
> TorrentBroadcast.scala:89)
> at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.
> scala:67)
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> at org.apache.spark.executor.Executor$TaskRunner.run(
> Executor.scala:274)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by:

Re: Design patterns involving Spark

2016-08-30 Thread Todd Nist

Have not tried this, but looks quite useful if one is using Druid:

https://github.com/implydata/pivot  - An interactive data exploration UI
for Druid

On Tue, Aug 30, 2016 at 4:10 AM, Alonso Isidoro Roman 
wrote:

> Thanks Mitch, i will check it.
>
> Cheers
>
>
> Alonso Isidoro Roman
> [image: https://]about.me/alonso.isidoro.roman
>
> 
>
> 2016-08-30 9:52 GMT+02:00 Mich Talebzadeh :
>
>> You can use Hbase for building real time dashboards
>>
>> Check this link
>> 
>>
>> HTH
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 30 August 2016 at 08:33, Alonso Isidoro Roman 
>> wrote:
>>
>>> HBase for real time queries? HBase was designed with the batch in mind.
>>> Impala should be a best choice, but i do not know what Druid can do
>>>
>>>
>>> Cheers
>>>
>>> Alonso Isidoro Roman
>>> [image: https://]about.me/alonso.isidoro.roman
>>>
>>> 
>>>
>>> 2016-08-30 8:56 GMT+02:00 Mich Talebzadeh :
>>>
 Hi Chanh,

 Druid sounds like a good choice.

 But again the point being is that what else Druid brings on top of
 Hbase.

 Unless one decides to use Druid for both historical data and real time
 data in place of Hbase!

 It is easier to write API against Druid that Hbase? You still want a UI
 dashboard?

 Cheers

 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com


 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.



 On 30 August 2016 at 03:19, Chanh Le  wrote:

> Hi everyone,
>
> Seems a lot people using Druid for realtime Dashboard.
> I’m just wondering of using Druid for main storage engine because
> Druid can store the raw data and can integrate with Spark also
> (theoretical).
> In that case do we need to store 2 separate storage Druid (store
> segment in HDFS) and HDFS?.
> BTW did anyone try this one https://github.com/Sparkli
> neData/spark-druid-olap?
>
>
> Regards,
> Chanh
>
>
> On Aug 30, 2016, at 3:23 AM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
> Thanks Bhaarat and everyone.
>
> This is an updated version of the same diagram
>
> 
> 
> The frequency of Recent data is defined by the Windows length in Spark
> Streaming. It can vary between 0.5 seconds to an hour. ( Don't think we 
> can
> move any Spark granularity below 0.5 seconds in anger. For some
> applications like Credit card transactions and fraud detection. Data is
> stored real time by Spark in Hbase tables. Hbase tables will be on HDFS as
> well. The same Spark Streaming will write asynchronously to HDFS Hive
> tables.
> One school of thought is never write to Hive from Spark, write
>  straight to Hbase and then read Hbase tables into Hive periodically?
>
> Now the third component in this layer is Serving Layer that can
> combine data from the current (Hbase) and the historical (Hive tables) to
> give the user visual analytics. Now that visual analytics can be Real time
> dashboard on top of Serving Layer. That Serving layer could be an 
> in-memory
> NoSQL offering or Data from Hbase (Red Box) combined with Hive tables.
>
> I am not aware of any industrial strength Real time Dashboard.  The
> idea is that one uses such dashboard in real time. Dashboard in this sense
> meaning a general purpose API to data store of some type like on

Re: Writing to Hbase table from Spark

2016-08-30 Thread Todd Nist

Have  you looked at spark-packges.org?  There are several different HBase
connectors there, not sure if any meet  you need or not.

https://spark-packages.org/?q=hbase

HTH,

-Todd

On Tue, Aug 30, 2016 at 5:23 AM, ayan guha  wrote:

> You can use rdd level new hadoop format api and pass on appropriate
> classes.
> On 30 Aug 2016 19:13, "Mich Talebzadeh"  wrote:
>
>> Hi,
>>
>> Is there an existing interface to read from and write to Hbase table in
>> Spark.
>>
>> Similar to below for Parquet
>>
>> val s = spark.read.parquet("oraclehadoop.sales2")
>> s.write.mode("overwrite").parquet("oraclehadoop.sales4")
>>
>> Or need too write Hive table which is already defined over Hbase?
>>
>>
>> Thanks
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>

Re: HiveThriftServer2.startWithContext no more showing tables in 1.6.2

2016-07-21 Thread Todd Nist

This is due to a change in 1.6,  by default the Thrift server runs in
multi-session mode. You would want to set the following to true on your
spark config.

spark-default.conf set spark.sql.hive.thriftServer.singleSession

Good write up here:
https://community.hortonworks.com/questions/29090/i-cant-find-my-tables-in-spark-sql-using-beeline.html

HTH.

-Todd

On Thu, Jul 21, 2016 at 10:30 AM, Marco Colombo  wrote:

> Thanks.
>
> That is just a typo. I'm using on 'spark://10.0.2.15:7077' (standalone).
> Same url used in --master in spark-submit
>
>
>
> 2016-07-21 16:08 GMT+02:00 Mich Talebzadeh :
>
>> Hi Marco
>>
>> In your code
>>
>> val conf = new SparkConf()
>>   .setMaster("spark://10.0.2.15:7077")
>>   .setMaster("local")
>>   .set("spark.cassandra.connection.host", "10.0.2.15")
>>   .setAppName("spark-sql-dataexample");
>>
>> As I understand the first .setMaster("spark://:7077 indicates
>> that you are using Spark in standalone mode and then .setMaster("local")
>> means you are using it in Local mode?
>>
>> Any reason for it?
>>
>> Basically you are overriding standalone with local.
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 21 July 2016 at 14:55, Marco Colombo 
>> wrote:
>>
>>> Hi all, I have a spark application that was working in 1.5.2, but now
>>> has a problem in 1.6.2.
>>>
>>> Here is an example:
>>>
>>> val conf = new SparkConf()
>>>   .setMaster("spark://10.0.2.15:7077")
>>>   .setMaster("local")
>>>   .set("spark.cassandra.connection.host", "10.0.2.15")
>>>   .setAppName("spark-sql-dataexample");
>>>
>>> val hiveSqlContext = new HiveContext(SparkContext.getOrCreate(conf));
>>>
>>> //Registering tables
>>> var query = """OBJ_TAB""".stripMargin;
>>>
>>> val options = Map(
>>>   "driver" -> "org.postgresql.Driver",
>>>   "url" -> "jdbc:postgresql://127.0.0.1:5432/DB",
>>>   "user" -> "postgres",
>>>   "password" -> "postgres",
>>>   "dbtable" -> query);
>>>
>>> import hiveSqlContext.implicits._;
>>> val df: DataFrame =
>>> hiveSqlContext.read.format("jdbc").options(options).load();
>>> df.registerTempTable("V_OBJECTS");
>>>
>>>  val optionsC = Map("table"->"data_tab", "keyspace"->"data");
>>> val stats : DataFrame =
>>> hiveSqlContext.read.format("org.apache.spark.sql.cassandra").options(optionsC).load();
>>> //stats.foreach { x => println(x) }
>>> stats.registerTempTable("V_DATA");
>>>
>>> //START HIVE SERVER
>>> HiveThriftServer2.startWithContext(hiveSqlContext);
>>>
>>> Now, from app I can perform queries and joins over the 2 registered
>>> table, but if I connect to port 1 via beeline, I see no registered
>>> tables.
>>> show tables is empty.
>>>
>>> I'm using embedded DERBY DB, but this was working in 1.5.2.
>>>
>>> Any suggestion?
>>>
>>> Thanks
>>>
>>>
>>
>
>
> --
> Ing. Marco Colombo
>

Re: Load selected rows with sqlContext in the dataframe

2016-07-21 Thread Todd Nist

You can set the dbtable to this:

.option("dbtable", "(select * from master_schema where 'TID' = '100_0')")

HTH,

Todd


On Thu, Jul 21, 2016 at 10:59 AM, sujeet jog  wrote:

> I have a table of size 5GB, and want to load selective rows into dataframe
> instead of loading the entire table in memory,
>
>
> For me memory is a constraint hence , and i would like to peridically load
> few set of rows and perform dataframe operations on it,
>
> ,
> for the "dbtable"  is there a way to perform select * from master_schema
> where 'TID' = '100_0';
> which can load only this to memory as dataframe .
>
>
>
> Currently  I'm using code as below
> val df  =  sqlContext.read .format("jdbc")
>   .option("url", url)
>   .option("dbtable", "master_schema").load()
>
>
> Thansk,
> Sujeet
>

Re: Apache Spark Kafka Integration - org.apache.spark.SparkException: Couldn't find leader offsets for Set()

2016-06-07 Thread Todd Nist

Hi Dominik,

Right, and spark 1.6.x uses Kafka v0.8.2.x as I recall.  However, it
appears as though the v.0.8 consumer is compatible with the Kafka v0.9.x
broker, but not the other way around; sorry for the confusion there.

With the direct stream, simple consumer, offsets are tracked by Spark
Streaming within its checkpoints by default.  You can also manage them
yourself if desired.  How are you dealing with offsets ?

Can you verify the offsets on the broker:

kafka-run-class.sh kafka.tools.GetOffsetShell --topic  --broker-list
 --time -1

-Todd

On Tue, Jun 7, 2016 at 8:17 AM, Dominik Safaric <dominiksafa...@gmail.com>
wrote:

> libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "1.6.0"
> libraryDependencies += "org.apache.spark" % "spark-streaming_2.10" % "1.6.0"
> libraryDependencies += "org.apache.spark" % "spark-streaming-kafka_2.10" % 
> "1.6.1"
>
> Please take a look at the SBT copy.
>
> I would rather think that the problem is related to the Zookeeper/Kafka
> consumers.
>
> [2016-06-07 11:24:52,484] WARN Either no config or no quorum defined in
> config, running  in standalone mode
> (org.apache.zookeeper.server.quorum.QuorumPeerMain)
>
> Any indication onto why the channel connection might be closed? Would it
> be Kafka or Zookeeper related?
>
> On 07 Jun 2016, at 14:07, Todd Nist <tsind...@gmail.com> wrote:
>
> What version of Spark are you using?  I do not believe that 1.6.x is
> compatible with 0.9.0.1 due to changes in the kafka clients between 0.8.2.2
> and 0.9.0.x.  See this for more information:
>
> https://issues.apache.org/jira/browse/SPARK-12177
>
> -Todd
>
> On Tue, Jun 7, 2016 at 7:35 AM, Dominik Safaric <dominiksafa...@gmail.com>
> wrote:
>
>> Hi,
>>
>> Correct, I am using the 0.9.0.1 version.
>>
>> As already described, the topic contains messages. Those messages are
>> produced using the Confluence REST API.
>>
>> However, what I’ve observed is that the problem is not in the Spark
>> configuration, but rather Zookeeper or Kafka related.
>>
>> Take a look at the exception’s stack top item:
>>
>> org.apache.spark.SparkException: java.nio.channels.ClosedChannelException
>> org.apache.spark.SparkException: Couldn't find leader offsets for
>> Set([,0])
>> at
>> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
>> at
>> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
>> at scala.util.Either.fold(Either.scala:97)
>> at
>> org.apache.spark.streaming.kafka.KafkaCluster$.checkErrors(KafkaCluster.scala:365)
>> at
>> org.apache.spark.streaming.kafka.KafkaUtils$.getFromOffsets(KafkaUtils.scala:222)
>> at
>> org.apache.spark.streaming.kafka.KafkaUtils$.createDirectStream(KafkaUtils.scala:484)
>> at org.mediasoft.spark.Driver$.main(Driver.scala:22)
>> at .(:11)
>> at .()
>> at .(:7)
>>
>> By listing all active connections using netstat, I’ve also observed that
>> both Zookeper and Kafka are running. Zookeeper on port 2181, while Kafka
>> 9092.
>>
>> Furthermore, I am also able to retrieve all log messages using the
>> console consumer.
>>
>> Any clue what might be going wrong?
>>
>> On 07 Jun 2016, at 13:13, Jacek Laskowski <ja...@japila.pl> wrote:
>>
>> Hi,
>>
>> What's the version of Spark? You're using Kafka 0.9.0.1, ain't you?
>> What's the topic name?
>>
>> Jacek
>> On 7 Jun 2016 11:06 a.m., "Dominik Safaric" <dominiksafa...@gmail.com>
>> wrote:
>>
>>> As I am trying to integrate Kafka into Spark, the following exception
>>> occurs:
>>>
>>> org.apache.spark.SparkException: java.nio.channels.ClosedChannelException
>>> org.apache.spark.SparkException: Couldn't find leader offsets for
>>> Set([**,0])
>>> at
>>>
>>> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
>>> at
>>>
>>> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
>>> at scala.util.Either.fold(Either.scala:97)
>>> at
>>>
>>> org.apache.spark.streaming.kafka.KafkaCluster$.checkErrors(KafkaCluster.scala:365)
>>> at
>>>
>>> org.apache.spark.streaming.kafka.KafkaUtils$.getFromOffsets(KafkaUtils.scala:222)
>>> at
>>>
>>> org.apache.spark.streaming.kafka.KafkaUti

Re: Apache Spark Kafka Integration - org.apache.spark.SparkException: Couldn't find leader offsets for Set()

2016-06-07 Thread Todd Nist

What version of Spark are you using?  I do not believe that 1.6.x is
compatible with 0.9.0.1 due to changes in the kafka clients between 0.8.2.2
and 0.9.0.x.  See this for more information:

https://issues.apache.org/jira/browse/SPARK-12177

-Todd

On Tue, Jun 7, 2016 at 7:35 AM, Dominik Safaric 
wrote:

> Hi,
>
> Correct, I am using the 0.9.0.1 version.
>
> As already described, the topic contains messages. Those messages are
> produced using the Confluence REST API.
>
> However, what I’ve observed is that the problem is not in the Spark
> configuration, but rather Zookeeper or Kafka related.
>
> Take a look at the exception’s stack top item:
>
> org.apache.spark.SparkException: java.nio.channels.ClosedChannelException
> org.apache.spark.SparkException: Couldn't find leader offsets for
> Set([,0])
> at
> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
> at
> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
> at scala.util.Either.fold(Either.scala:97)
> at
> org.apache.spark.streaming.kafka.KafkaCluster$.checkErrors(KafkaCluster.scala:365)
> at
> org.apache.spark.streaming.kafka.KafkaUtils$.getFromOffsets(KafkaUtils.scala:222)
> at
> org.apache.spark.streaming.kafka.KafkaUtils$.createDirectStream(KafkaUtils.scala:484)
> at org.mediasoft.spark.Driver$.main(Driver.scala:22)
> at .(:11)
> at .()
> at .(:7)
>
> By listing all active connections using netstat, I’ve also observed that
> both Zookeper and Kafka are running. Zookeeper on port 2181, while Kafka
> 9092.
>
> Furthermore, I am also able to retrieve all log messages using the console
> consumer.
>
> Any clue what might be going wrong?
>
> On 07 Jun 2016, at 13:13, Jacek Laskowski  wrote:
>
> Hi,
>
> What's the version of Spark? You're using Kafka 0.9.0.1, ain't you? What's
> the topic name?
>
> Jacek
> On 7 Jun 2016 11:06 a.m., "Dominik Safaric" 
> wrote:
>
>> As I am trying to integrate Kafka into Spark, the following exception
>> occurs:
>>
>> org.apache.spark.SparkException: java.nio.channels.ClosedChannelException
>> org.apache.spark.SparkException: Couldn't find leader offsets for
>> Set([**,0])
>> at
>>
>> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
>> at
>>
>> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
>> at scala.util.Either.fold(Either.scala:97)
>> at
>>
>> org.apache.spark.streaming.kafka.KafkaCluster$.checkErrors(KafkaCluster.scala:365)
>> at
>>
>> org.apache.spark.streaming.kafka.KafkaUtils$.getFromOffsets(KafkaUtils.scala:222)
>> at
>>
>> org.apache.spark.streaming.kafka.KafkaUtils$.createDirectStream(KafkaUtils.scala:484)
>> at org.mediasoft.spark.Driver$.main(Driver.scala:42)
>> at .(:11)
>> at .()
>> at .(:7)
>> at .()
>> at $print()
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>>
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>> at
>>
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:483)
>> at
>> scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:734)
>> at
>> scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:983)
>> at
>> scala.tools.nsc.interpreter.IMain.loadAndRunReq$1(IMain.scala:573)
>> at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:604)
>> at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:568)
>> at
>> scala.tools.nsc.interpreter.ILoop.reallyInterpret$1(ILoop.scala:760)
>> at
>> scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:805)
>> at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:717)
>> at
>> scala.tools.nsc.interpreter.ILoop.processLine$1(ILoop.scala:581)
>> at scala.tools.nsc.interpreter.ILoop.innerLoop$1(ILoop.scala:588)
>> at scala.tools.nsc.interpreter.ILoop.loop(ILoop.scala:591)
>> at
>>
>> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:882)
>> at
>>
>> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:837)
>> at
>>
>> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:837)
>> at
>>
>> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
>> at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:837)
>> at scala.tools.nsc.interpreter.ILoop.main(ILoop.scala:904)
>> at
>>
>> org.jetbrains.plugins.scala.compiler.rt.ConsoleRunner.main(ConsoleRunner.java:64)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>>
>>

Re: Unit testing framework for Spark Jobs?

2016-05-18 Thread Todd Nist

Perhaps these may be of some use:

https://github.com/mkuthan/example-spark
http://mkuthan.github.io/blog/2015/03/01/spark-unit-testing/
https://github.com/holdenk/spark-testing-base

On Wed, May 18, 2016 at 2:14 PM, swetha kasireddy  wrote:

> Hi Lars,
>
> Do you have any examples for the methods that you described for Spark
> batch and Streaming?
>
> Thanks!
>
> On Wed, Mar 30, 2016 at 2:41 AM, Lars Albertsson 
> wrote:
>
>> Thanks!
>>
>> It is on my backlog to write a couple of blog posts on the topic, and
>> eventually some example code, but I am currently busy with clients.
>>
>> Thanks for the pointer to Eventually - I was unaware. Fast exit on
>> exception would be a useful addition, indeed.
>>
>> Lars Albertsson
>> Data engineering consultant
>> www.mapflat.com
>> +46 70 7687109
>>
>> On Mon, Mar 28, 2016 at 2:00 PM, Steve Loughran 
>> wrote:
>> > this is a good summary -Have you thought of publishing it at the end of
>> a URL for others to refer to
>> >
>> >> On 18 Mar 2016, at 07:05, Lars Albertsson  wrote:
>> >>
>> >> I would recommend against writing unit tests for Spark programs, and
>> >> instead focus on integration tests of jobs or pipelines of several
>> >> jobs. You can still use a unit test framework to execute them. Perhaps
>> >> this is what you meant.
>> >>
>> >> You can use any of the popular unit test frameworks to drive your
>> >> tests, e.g. JUnit, Scalatest, Specs2. I prefer Scalatest, since it
>> >> gives you choice of TDD vs BDD, and it is also well integrated with
>> >> IntelliJ.
>> >>
>> >> I would also recommend against using testing frameworks tied to a
>> >> processing technology, such as Spark Testing Base. Although it does
>> >> seem well crafted, and makes it easy to get started with testing,
>> >> there are drawbacks:
>> >>
>> >> 1. I/O routines are not tested. Bundled test frameworks typically do
>> >> not materialise datasets on storage, but pass them directly in memory.
>> >> (I have not verified this for Spark Testing Base, but it looks so.)
>> >> I/O routines are therefore not exercised, and they often hide bugs,
>> >> e.g. related to serialisation.
>> >>
>> >> 2. You create a strong coupling between processing technology and your
>> >> tests. If you decide to change processing technology (which can happen
>> >> soon in this fast paced world...), you need to rewrite your tests.
>> >> Therefore, during a migration process, the tests cannot detect bugs
>> >> introduced in migration, and help you migrate fast.
>> >>
>> >> I recommend that you instead materialise input datasets on local disk,
>> >> run your Spark job, which writes output datasets to local disk, read
>> >> output from disk, and verify the results. You can still use Spark
>> >> routines to read and write input and output datasets. A Spark context
>> >> is expensive to create, so for speed, I would recommend reusing the
>> >> Spark context between input generation, running the job, and reading
>> >> output.
>> >>
>> >> This is easy to set up, so you don't need a dedicated framework for
>> >> it. Just put your common boilerplate in a shared test trait or base
>> >> class.
>> >>
>> >> In the future, when you want to replace your Spark job with something
>> >> shinier, you can still use the old tests, and only replace the part
>> >> that runs your job, giving you some protection from regression bugs.
>> >>
>> >>
>> >> Testing Spark Streaming applications is a different beast, and you can
>> >> probably not reuse much from your batch testing.
>> >>
>> >> For testing streaming applications, I recommend that you run your
>> >> application inside a unit test framework, e.g, Scalatest, and have the
>> >> test setup create a fixture that includes your input and output
>> >> components. For example, if your streaming application consumes from
>> >> Kafka and updates tables in Cassandra, spin up single node instances
>> >> of Kafka and Cassandra on your local machine, and connect your
>> >> application to them. Then feed input to a Kafka topic, and wait for
>> >> the result to appear in Cassandra.
>> >>
>> >> With this setup, your application still runs in Scalatest, the tests
>> >> run without custom setup in maven/sbt/gradle, and you can easily run
>> >> and debug inside IntelliJ.
>> >>
>> >> Docker is suitable for spinning up external components. If you use
>> >> Kafka, the Docker image spotify/kafka is useful, since it bundles
>> >> Zookeeper.
>> >>
>> >> When waiting for output to appear, don't sleep for a long time and
>> >> then check, since it will slow down your tests. Instead enter a loop
>> >> where you poll for the results and sleep for a few milliseconds in
>> >> between, with a long timeout (~30s) before the test fails with a
>> >> timeout.
>> >
>> > org.scalatest.concurrent.Eventually is your friend there
>> >
>> > eventually(stdTimeout, stdInterval) {
>> > listRestAPIApplications(connector, webUI, true) should

Re: Spark SQL Transaction

2016-04-23 Thread Todd Nist

I believe the class you are looking for is
org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala.

By default in savePartition(...) , it will do the following:

if (supportsTransactions) { conn.setAutoCommit(false) // Everything in the
same db transaction. } Then at line 224, it will issue the commit:
if (supportsTransactions) { conn.commit() } HTH -Todd

On Sat, Apr 23, 2016 at 8:57 AM, Andrés Ivaldi  wrote:

> Hello, so I executed Profiler and found that implicit isolation was turn
> on by JDBC driver, this is the default behavior of MSSQL JDBC driver, but
> it's possible change it with setAutoCommit method. There is no property for
> that so I've to do it in the code, do you now where can I access to the
> instance of JDBC class used by Spark on DataFrames?
>
> Regards.
>
> On Thu, Apr 21, 2016 at 10:59 AM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> This statement
>>
>> ."..each database statement is atomic and is itself a transaction.. your
>> statements should be atomic and there will be no ‘redo’ or ‘commit’ or
>> ‘rollback’."
>>
>> MSSQL compiles with ACIDITY which requires that each transaction be "all
>> or nothing": if one part of the transaction fails, then the entire
>> transaction fails, and the database state is left unchanged.
>>
>> Assuming that it is one transaction (through much doubt if JDBC does that
>> as it will take for ever), then either that transaction commits (in MSSQL
>> redo + undo are combined in syslogs table of the database) meaning
>> there will be undo + redo log generated  for that row only in syslogs. So
>> under normal operation every RDBMS including MSSQL, Oracle, Sybase and
>> others will comply with generating (redo and undo) and one cannot avoid it.
>> If there is a batch transaction as I suspect in this case, it is either all
>> or nothing. The thread owner indicated that rollback is happening so it is
>> consistent with all rows rolled back.
>>
>> I don't think Spark, Sqoop, Hive can influence the transaction behaviour
>> of an RDBMS for DML. DQ (data queries) do not generate transactions.
>>
>> HTH
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 21 April 2016 at 13:58, Michael Segel 
>> wrote:
>>
>>> Hi,
>>>
>>> Sometimes terms get muddled over time.
>>>
>>> If you’re not using transactions, then each database statement is atomic
>>> and is itself a transaction.
>>> So unless you have some explicit ‘Begin Work’ at the start…. your
>>> statements should be atomic and there will be no ‘redo’ or ‘commit’ or
>>> ‘rollback’.
>>>
>>> I don’t see anything in Spark’s documentation about transactions, so the
>>> statements should be atomic.  (I’m not a guru here so I could be missing
>>> something in Spark)
>>>
>>> If you’re seeing the connection drop unexpectedly and then a rollback,
>>> could this be a setting or configuration of the database?
>>>
>>>
>>> > On Apr 19, 2016, at 1:18 PM, Andrés Ivaldi  wrote:
>>> >
>>> > Hello, is possible to execute a SQL write without Transaction? we dont
>>> need transactions to save our data and this adds an overhead to the
>>> SQLServer.
>>> >
>>> > Regards.
>>> >
>>> > --
>>> > Ing. Ivaldi Andres
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>
>
> --
> Ing. Ivaldi Andres
>

Re: Spark 1.6.1. How to prevent serialization of KafkaProducer

2016-04-21 Thread Todd Nist

Have you looked at these:

http://allegro.tech/2015/08/spark-kafka-integration.html
http://mkuthan.github.io/blog/2016/01/29/spark-kafka-integration2/

Full example here:

https://github.com/mkuthan/example-spark-kafka

HTH.

-Todd

On Thu, Apr 21, 2016 at 2:08 PM, Alexander Gallego 
wrote:

> Thanks Ted.
>
>  KafkaWordCount (producer) does not operate on a DStream[T]
>
> ```scala
>
>
> object KafkaWordCountProducer {
>
>   def main(args: Array[String]) {
> if (args.length < 4) {
>   System.err.println("Usage: KafkaWordCountProducer
>   " +
> " ")
>   System.exit(1)
> }
>
> val Array(brokers, topic, messagesPerSec, wordsPerMessage) = args
>
> // Zookeeper connection properties
> val props = new HashMap[String, Object]()
> props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, brokers)
> props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
>   "org.apache.kafka.common.serialization.StringSerializer")
> props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
>   "org.apache.kafka.common.serialization.StringSerializer")
>
> val producer = new KafkaProducer[String, String](props)
>
> // Send some messages
> while(true) {
>   (1 to messagesPerSec.toInt).foreach { messageNum =>
> val str = (1 to wordsPerMessage.toInt).map(x =>
> scala.util.Random.nextInt(10).toString)
>   .mkString(" ")
>
> val message = new ProducerRecord[String, String](topic, null, str)
> producer.send(message)
>   }
>
>   Thread.sleep(1000)
> }
>   }
>
> }
>
> ```
>
>
> Also, doing:
>
>
> ```
> object KafkaSink {
>  def send(brokers: String, sc: SparkContext, topic: String, key:
> String, value: String) =
> getInstance(brokers, sc).value.send(new ProducerRecord(topic,
> key, value))
> }
>
> KafkaSink.send(brokers, sparkContext)(outputTopic, record._1, record._2)
>
> ```
>
>
> Doesn't work either, the result is:
>
> Exception in thread "main" org.apache.spark.SparkException: Task not
> serializable
>
>
> Thanks!
>
>
>
>
> On Thu, Apr 21, 2016 at 1:08 PM, Ted Yu  wrote:
> >
> > In KafkaWordCount , the String is sent back and producer.send() is
> called.
> >
> > I guess if you don't find via solution in your current design, you can
> consider the above.
> >
> > On Thu, Apr 21, 2016 at 10:04 AM, Alexander Gallego 
> wrote:
> >>
> >> Hello,
> >>
> >> I understand that you cannot serialize Kafka Producer.
> >>
> >> So I've tried:
> >>
> >> (as suggested here
> https://forums.databricks.com/questions/369/how-do-i-handle-a-task-not-serializable-exception.html)
>
> >>
> >>  - Make the class Serializable - not possible
> >>
> >>  - Declare the instance only within the lambda function passed in map.
> >>
> >> via:
> >>
> >> // as suggested by the docs
> >>
> >>
> >> ```scala
> >>
> >>kafkaOut.foreachRDD(rdd => {
> >>  rdd.foreachPartition(partition => {
> >>   val producer = new KafkaProducer(..)
> >>   partition.foreach { record =>
> >>   producer.send(new ProducerRecord(outputTopic, record._1,
> record._2)
> >>}
> >>   producer.close()
> >>})
> >>  }) // foreachRDD
> >>
> >>
> >> ```
> >>
> >> - Make the NotSerializable object as a static and create it once per
> machine.
> >>
> >> via:
> >>
> >>
> >> ```scala
> >>
> >>
> >> object KafkaSink {
> >>   @volatile private var instance: Broadcast[KafkaProducer[String,
> String]] = null
> >>   def getInstance(brokers: String, sc: SparkContext):
> Broadcast[KafkaProducer[String, String]] = {
> >> if (instance == null) {
> >>   synchronized {
> >> println("Creating new kafka producer")
> >> val props = new java.util.Properties()
> >> ...
> >> instance = sc.broadcast(new KafkaProducer[String,
> String](props))
> >> sys.addShutdownHook {
> >>   instance.value.close()
> >> }
> >>   }
> >> }
> >> instance
> >>   }
> >> }
> >>
> >>
> >> ```
> >>
> >>
> >>
> >>  - Call rdd.forEachPartition and create the NotSerializable object in
> there like this:
> >>
> >> Same as above.
> >>
> >>
> >> - Mark the instance @transient
> >>
> >> Same thing, just make it a class variable via:
> >>
> >>
> >> ```
> >> @transient var producer: KakfaProducer[String,String] = null
> >> def getInstance() = {
> >>if( producer == null ) {
> >>producer = new KafkaProducer()
> >>}
> >>producer
> >> }
> >>
> >> ```
> >>
> >>
> >> However, I get serialization problems with all of these options.
> >>
> >>
> >> Thanks for your help.
> >>
> >> - Alex
> >>
> >
>
>
>
> --
>
>
>
>
>
> Alexander Gallego
> Co-Founder & CTO
>

Re: How to change akka.remote.startup-timeout in spark

2016-04-21 Thread Todd Nist

I believe you can adjust it by setting the following:

spark.akka.timeout 100s Communication timeout between Spark nodes.

HTH.

-Todd



On Thu, Apr 21, 2016 at 9:49 AM, yuemeng (A)  wrote:

> When I run a spark application,sometimes I get follow ERROR:
>
> 16/04/21 09:26:45 ERROR SparkContext: Error initializing SparkContext.
>
> java.util.concurrent.TimeoutException: Futures timed out after [1
> milliseconds]
>
>  at
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>
>  at
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>
>  at
> scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>
>  at
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>
>  at scala.concurrent.Await$.result(package.scala:107)
>
>  at akka.remote.Remoting.start(Remoting.scala:180)
>
>  at
> akka.remote.RemoteActorRefProvider.init(RemoteActorRefProvider.scala:184)
>
>  at akka.actor.ActorSystemImpl.liftedTree2$1(ActorSystem.scala:618)
>
>  at
> akka.actor.ActorSystemImpl._start$lzycompute(ActorSystem.scala:615)
>
>  at akka.actor.ActorSystemImpl._start(ActorSystem.scala:615)
>
>  at akka.actor.ActorSystemImpl.start(ActorSystem.scala:632)
>
>  at akka.actor.ActorSystem$.apply(ActorSystem.scala:141)
>
>  at akka.actor.ActorSystem$.apply(ActorSystem.scala:118)
>
>  at
> org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:122)
>
>  at
> org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:54)
>
>  at
> org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:53)
>
>  at
> org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1995)
>
>  at
> scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
>
>  at
> org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1986)
>
>  at
> org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:56)
>
>  at
> org.apache.spark.rpc.akka.AkkaRpcEnvFactory.create(AkkaRpcEnv.scala:245)
>
>
>
>
>
> AND I track the code ,I think if we update akka.remote.startup-timeout
> mabe solve this problem,but I can’t find any way to change this,
>
> Do anybody met this problem and know how to change akka config in spark?
>
> Thanks a lot
>
>
>
> *岳猛(Rick) 00277916*
>
> *大数据技术开发部*
>
>
> *
>
> [image: cid:image012.jpg@01D0D9C8.DDEDCC20]*文档包*
> 
>
> [image: cid:image009.png@01D0DA69.58E5C9A0]*培训中心*
> 
>
> [image: cid:image010.png@01D0DA69.58E5C9A0]*案例库*
> 
>
>   *中软大数据3ms团队： **http://3ms.huawei.com/hi/group/2031037*
> 
>
>
>
>
>

Re: Apache Flink

2016-04-17 Thread Todd Nist

So there is an offering from Stratio, https://github.com/Stratio/Decision

Decision CEP engine is a Complex Event Processing platform built on Spark
> Streaming.
>


> It is the result of combining the power of Spark Streaming as a continuous
> computing framework and Siddhi CEP engine as complex event processing
> engine.


https://stratio.atlassian.net/wiki/display/DECISION0x9/Home

I have not used it, only read about it but it may be of some interest to
you.

-Todd

On Sun, Apr 17, 2016 at 5:49 PM, Peyman Mohajerian 
wrote:

> Microbatching is certainly not a waste of time, you are making way too
> strong of an statement. In fact in certain cases one tuple at the time
> makes no sense, it all depends on the use cases. In fact if you understand
> the history of the project Storm you would know that microbatching was
> added later in Storm, Trident, and it is specifically for
> microbatching/windowing.
> In certain cases you are doing aggregation/windowing and throughput is the
> dominant design consideration and you don't care what each individual
> event/tuple does, e.g. of you push different event types to separate kafka
> topics and all you care is to do a count, what is the need for single event
> processing.
>
> On Sun, Apr 17, 2016 at 12:43 PM, Corey Nolet  wrote:
>
>> i have not been intrigued at all by the microbatching concept in Spark. I
>> am used to CEP in real streams processing environments like Infosphere
>> Streams & Storm where the granularity of processing is at the level of each
>> individual tuple and processing units (workers) can react immediately to
>> events being received and processed. The closest Spark streaming comes to
>> this concept is the notion of "state" that that can be updated via the
>> "updateStateBykey()" functions which are only able to be run in a
>> microbatch. Looking at the expected design changes to Spark Streaming in
>> Spark 2.0.0, it also does not look like tuple-at-a-time processing is on
>> the radar for Spark, though I have seen articles stating that more effort
>> is going to go into the Spark SQL layer in Spark streaming which may make
>> it more reminiscent of Esper.
>>
>> For these reasons, I have not even tried to implement CEP in Spark. I
>> feel it's a waste of time without immediate tuple-at-a-time processing.
>> Without this, they avoid the whole problem of "back pressure" (though keep
>> in mind, it is still very possible to overload the Spark streaming layer
>> with stages that will continue to pile up and never get worked off) but
>> they lose the granular control that you get in CEP environments by allowing
>> the rules & processors to react with the receipt of each tuple, right away.
>>
>> Awhile back, I did attempt to implement an InfoSphere Streams-like API
>> [1] on top of Apache Storm as an example of what such a design may look
>> like. It looks like Storm is going to be replaced in the not so distant
>> future by Twitter's new design called Heron. IIRC, Heron does not have an
>> open source implementation as of yet.
>>
>> [1] https://github.com/calrissian/flowmix
>>
>> On Sun, Apr 17, 2016 at 3:11 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi Corey,
>>>
>>> Can you please point me to docs on using Spark for CEP? Do we have a set
>>> of CEP libraries somewhere. I am keen on getting hold of adaptor libraries
>>> for Spark something like below
>>>
>>>
>>>
>>> 
>>> Thanks
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 17 April 2016 at 16:07, Corey Nolet  wrote:
>>>
 One thing I've noticed about Flink in my following of the project has
 been that it has established, in a few cases, some novel ideas and
 improvements over Spark. The problem with it, however, is that both the
 development team and the community around it are very small and many of
 those novel improvements have been rolled directly into Spark in subsequent
 versions. I was considering changing over my architecture to Flink at one
 point to get better, more real-time CEP streaming support, but in the end I
 decided to stick with Spark and just watch Flink continue to pressure it
 into improvement.

 On Sun, Apr 17, 2016 at 11:03 AM, Koert Kuipers 
 wrote:

> i never found much info that flink was actually designed to be fault
> tolerant. if fault tolerance is more bolt-on/add-on/afterthought then that
> doesn't bode well for large scale data processing. spark was designed with
> fault tolerance in mind from the beginning.
>
> On Sun, Apr 17, 2016 at 9:52 AM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Hi,
>>
>> I

Re: "bootstrapping" DStream state

2016-03-10 Thread Todd Nist

The updateStateByKey can be supplied an initialRDD to populate it with.
Per code (
https://github.com/apache/spark/blob/v1.4.0/streaming/src/main/scala/org/apache/spark/streaming/dstream/PairDStreamFunctions.scala#L435-L445
).

Provided here for your convenience.

  /**
   * Return a new "state" DStream where the state for each key is
updated by applying
   * the given function on the previous state of the key and the new
values of the key.
   * org.apache.spark.Partitioner is used to control the partitioning
of each RDD.
   * @param updateFunc State update function. If `this` function
returns None, then
   *   corresponding state key-value pair will be eliminated.
   * @param partitioner Partitioner for controlling the partitioning
of each RDD in the new
   *DStream.
   * @param initialRDD initial state value of each key.
   * @tparam S State type
   */
  def updateStateByKey[S: ClassTag](
  updateFunc: (Seq[V], Option[S]) => Option[S],
  partitioner: Partitioner,
  initialRDD: RDD[(K, S)]
): DStream[(K, S)] = ssc.withScope {
val cleanedUpdateF = sparkContext.clean(updateFunc)
val newUpdateFunc = (iterator: Iterator[(K, Seq[V], Option[S])]) => {
  iterator.flatMap(t => cleanedUpdateF(t._2, t._3).map(s => (t._1, s)))
}
updateStateByKey(newUpdateFunc, partitioner, true, initialRDD)
  }

Simple example by Aniket Bhatnagar from an earlier thread on the forum.

def counter(events: Seq[Event], prevStateOpt: Option[Long]): Option[Long] = {
  val prevCount = prevStateOpt.getOrElse(0L)
  val newCount = prevCount + events.size
  Some(newCount)
}
val interval = 60 * 1000
val initialRDD = sparkContext.makeRDD(Array(1L, 2L, 3L, 4L, 5L)).map(_
* interval).map(n => (n % interval, n / interval))
val counts = eventsStream.map(event => {
  (event.timestamp - event.timestamp % interval, event)
}).updateStateByKey[Long](PrintEventCountsByInterval.counter _, new
HashPartitioner(3), initialRDD = initialRDD)
counts.print()

HTH.

-Todd

On Thu, Mar 10, 2016 at 1:35 AM, Zalzberg, Idan (Agoda) <
idan.zalzb...@agoda.com> wrote:

> Hi,
>
>
>
> I have a spark-streaming application that basically keeps track of a
> string->string dictionary.
>
>
>
> So I have messages coming in with updates, like:
>
> “A”->”B”
>
> And I need to update the dictionary.
>
>
>
> This seems like a simple use case for the updateStateByKey method.
>
>
>
> However, my issue is that when the app starts I need to “initialize” the
> dictionary with data from a hive table, that has all the historical
> key/values with the dictionary.
>
>
>
> The only way I could think of is doing something like:
>
>
>
> val rdd =… //get data from hive
>
> *def *process(input: DStream[(String, String)]) = {
> input.join(rdd).updateStateByKey(*update*)
>   }
>
> So the join operation will be done on every incoming buffer, where in fact
> I only need it on initialization.
>
>
>
> Any idea how to achieve that?
>
>
>
> Thanks
>
> --
> This message is confidential and is for the sole use of the intended
> recipient(s). It may also be privileged or otherwise protected by copyright
> or other legal rules. If you have received it by mistake please let us know
> by reply email and delete it from your system. It is prohibited to copy
> this message or disclose its content to anyone. Any confidentiality or
> privilege is not waived or lost by any mistaken delivery or unauthorized
> disclosure of the message. All messages sent to and from Agoda may be
> monitored to ensure compliance with company policies, to protect the
> company's interests and to remove potential malware. Electronic messages
> may be intercepted, amended, lost or deleted, or contain viruses.
>

Re: Spark Streaming, very slow processing and increasing scheduling delay of kafka input stream

2016-03-10 Thread Todd Nist

Hi Vinti,

All of your tasks are failing based on the screen shots provided.

I think a few more details would be helpful.  Is this YARN or a Standalone
cluster?  How much overall memory is on your cluster?  On each machine
where workers and executors are running?  Are you using the Direct
(KafkaUtils.createDirectStream) or Receiver (KafkaUtils.createStream)?

You may find this discussion of value on SO:
http://stackoverflow.com/questions/28901123/org-apache-spark-shuffle-metadatafetchfailedexception-missing-an-output-locatio

-Todd

On Mon, Mar 7, 2016 at 5:52 PM, Vinti Maheshwari 
wrote:

> Hi,
>
> My spark-streaming program seems very slow. I am using Ambari for cluster
> setup and i am using Kafka for data input.
> I tried to use batch size 2 secs and check pointing duration 10 secs. But
> as i was seeing scheduling delay was keep increasing so i tried increasing
> the batch size to 5 and then 10 secs. But it seems noting changed in
> respect of performance.
>
> *My program is doing two tasks:*
>
> 1) Data aggregation
>
> 2) Data insertion into Hbase
>
> Action which took maximum time, when i called foreachRDD on Dstream object
> (state).
>
> *state.foreachRDD(rdd => rdd.foreach(Blaher.blah))*
>
>
>
>
> *Program sample input coming from kafka:*
> test_id, file1, 1,1,1,1,1
>
> *Code snippets:*
>
> val parsedStream = inputStream
>   .map(line => {
> val splitLines = line.split(",")
> (splitLines(1), splitLines.slice(2, 
> splitLines.length).map((_.trim.toLong)))
>   })
> val state: DStream[(String, Array[Long])] = parsedStream.updateStateByKey(
> (current: Seq[Array[Long]], prev: Option[Array[Long]]) =>  {
>   prev.map(_ +: current).orElse(Some(current))
> .flatMap(as => Try(as.map(BDV(_)).reduce(_ + _).toArray).toOption)
> })
> *state.foreachRDD(rdd => rdd.foreach(Blaher.blah))*
>
>
>
> object Blaher {
>   def blah(tup: (String, Array[Long])) {
> val hConf = HBaseConfiguration.create()
> --
> val hTable = new HTable(hConf, tableName)
> val thePut = new Put(Bytes.toBytes("file_data"))
> thePut.add(Bytes.toBytes("file_counts"), Bytes.toBytes(tup._1), 
> Bytes.toBytes(tup._2.toList.toString))
> new ImmutableBytesWritable(Bytes.toBytes("file_data"))
>
> hTable.put(thePut)
>   }
> }
>
>
> *My Cluster Specifications:*
> 16 executors ( 1 core each and 2g memory)
>
> I have attached some screenshots of running execution.
>
> Anyone has idea what changes should i do to speedup the processing?
>
> Thanks & Regards,
>
> Vinti
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

Re: Building a REST Service with Spark back-end

2016-03-02 Thread Todd Nist

Have you looked at Apache Toree, http://toree.apache.org/.  This was
formerly the Spark-Kernel from IBM but contributed to apache.

https://github.com/apache/incubator-toree

You can find a good overview on the spark-kernel here:
http://www.spark.tc/how-to-enable-interactive-applications-against-apache-spark/

Not sure if that is of value to you or not.

HTH.

-Todd

On Tue, Mar 1, 2016 at 7:30 PM, Don Drake  wrote:

> I'm interested in building a REST service that utilizes a Spark SQL
> Context to return records from a DataFrame (or IndexedRDD?) and even
> add/update records.
>
> This will be a simple REST API, with only a few end-points.  I found this
> example:
>
> https://github.com/alexmasselot/spark-play-activator
>
> which looks close to what I am interested in doing.
>
> Are there any other ideas or options if I want to run this in a YARN
> cluster?
>
> Thanks.
>
> -Don
>
> --
> Donald Drake
> Drake Consulting
> http://www.drakeconsulting.com/
> https://twitter.com/dondrake 
> 800-733-2143
>

Re: Spark for client

2016-03-01 Thread Todd Nist

You could also look at Apache Toree, http://toree.apache.org/
, github : https://github.com/apache/incubator-toree.  This use to be the
Spark Kernel from IBM but has been contributed to Apache.

Good overview here on its features,
http://www.spark.tc/how-to-enable-interactive-applications-against-apache-spark/.
Specifically this section on usage:

Usage

Using the kernel as the backbone of communication, we have enabled several
higher-level applications to interact with Apache Spark:

*->* Livesheets
, a
line of business tool for data exploration

*->* A RESTful query engine
 running
on top of Spark SQL

*->* A demonstration of a PHP application utilizing Apache Spark
 at
ZendCon 2014

*->* IPython notebook
 running
the Spark Kernel underneath
HTH.
Todd

On Tue, Mar 1, 2016 at 4:10 AM, Mich Talebzadeh 
wrote:

> Thanks Mohannad.
>
> Installed Anaconda 3 that contains Jupyter. Now I want to access Spark on
> Scala from Jupyter. What is the easiest way of doing it without using
> Python!
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 1 March 2016 at 08:18, Mohannad Ali  wrote:
>
>> Jupyter (http://jupyter.org/) also supports Spark and generally it's a
>> beast allows you to do so much more.
>> On Mar 1, 2016 00:25, "Mich Talebzadeh" 
>> wrote:
>>
>>> Thank you very much both
>>>
>>> Zeppelin looks promising. Basically as I understand runs an agent on a
>>> given port (I chose 21999) on the host that Spark is installed. I created a
>>> notebook and running scripts through there. One thing for sure notebook
>>> just returns the results rather all other stuff that one does not need/.
>>>
>>> Cheers,
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 29 February 2016 at 19:22, Minudika Malshan 
>>> wrote:
>>>
 +Adding resources
 https://zeppelin.incubator.apache.org/docs/latest/interpreter/spark.html
 https://zeppelin.incubator.apache.org

 Minudika Malshan
 Undergraduate
 Department of Computer Science and Engineering
 University of Moratuwa.
 *Mobile : +94715659887 <%2B94715659887>*
 *LinkedIn* : https://lk.linkedin.com/in/minudika



 On Tue, Mar 1, 2016 at 12:51 AM, Minudika Malshan <
 minudika...@gmail.com> wrote:

> Hi,
>
> I think zeppelin spark interpreter will give a solution to your
> problem.
>
> Regards.
> Minudika
>
> Minudika Malshan
> Undergraduate
> Department of Computer Science and Engineering
> University of Moratuwa.
> *Mobile : +94715659887 <%2B94715659887>*
> *LinkedIn* : https://lk.linkedin.com/in/minudika
>
>
>
> On Tue, Mar 1, 2016 at 12:35 AM, Sabarish Sasidharan <
> sabarish.sasidha...@manthan.com> wrote:
>
>> Zeppelin?
>>
>> Regards
>> Sab
>> On 01-Mar-2016 12:27 am, "Mich Talebzadeh" 
>> wrote:
>>
>>> Hi,
>>>
>>> Is there such thing as Spark for client much like RDBMS client that
>>> have cut down version of their big brother useful for client 
>>> connectivity
>>> but cannot be used as server.
>>>
>>> Thanks
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>
>

>>>
>

Re: Spark Integration Patterns

2016-02-28 Thread Todd Nist

I'm not sure on Python, not expert in that area.  Based on pr,
https://github.com/apache/spark/pull/8318, I believe you are correct that
Spark would need to be installed for you to be able to currently leverage
the pyspark package.

On Sun, Feb 28, 2016 at 1:38 PM, moshir mikael <moshir.mik...@gmail.com>
wrote:

> Ok,
> but what do I need for the program to run.
> In python  sparkcontext  = SparkContext(conf) only works when you have
> spark installed locally.
> AFAIK there is no *pyspark *package for python that you can install doing
> pip install pyspark.
> You actually need to install spark to get it running (e.g :
> https://github.com/KristianHolsheimer/pyspark-setup-guide).
>
> Does it mean you need to install spark on the box your applications runs
> to benefit from pyspark and this is required to connect to another remote
> spark cluster ?
> Am I missing something obvious ?
>
>
> Le dim. 28 févr. 2016 à 19:01, Todd Nist <tsind...@gmail.com> a écrit :
>
>> Define your SparkConfig to set the master:
>>
>>   val conf = new SparkConf().setAppName(AppName)
>> .setMaster(SparkMaster)
>> .set()
>>
>> Where SparkMaster = "spark://SparkServerHost:7077".  So if your spark
>> server hostname it "RADTech" then it would be "spark://RADTech:7077".
>>
>> Then when you create the SparkContext, pass the SparkConf  to it:
>>
>> val sparkContext = new SparkContext(conf)
>>
>> Then use the sparkContext for interact with the SparkMaster / Cluster.
>> Your program basically becomes the driver.
>>
>> HTH.
>>
>> -Todd
>>
>> On Sun, Feb 28, 2016 at 9:25 AM, mms <moshir.mik...@gmail.com> wrote:
>>
>>> Hi, I cannot find a simple example showing how a typical application can
>>> 'connect' to a remote spark cluster and interact with it. Let's say I have
>>> a Python web application hosted somewhere *outside *a spark cluster,
>>> with just python installed on it. How can I talk to Spark without using a
>>> notebook, or using ssh to connect to a cluster master node ? I know of
>>> spark-submit and spark-shell, however forking a process on a remote host to
>>> execute a shell script seems like a lot of effort What are the recommended
>>> ways to connect and query Spark from a remote client ? Thanks Thx !
>>> --
>>> View this message in context: Spark Integration Patterns
>>> <http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Integration-Patterns-tp26354.html>
>>> Sent from the Apache Spark User List mailing list archive
>>> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>>>
>>
>>

Re: Spark Integration Patterns

2016-02-28 Thread Todd Nist

Define your SparkConfig to set the master:

  val conf = new SparkConf().setAppName(AppName)
.setMaster(SparkMaster)
.set()

Where SparkMaster = "spark://SparkServerHost:7077".  So if your spark
server hostname it "RADTech" then it would be "spark://RADTech:7077".

Then when you create the SparkContext, pass the SparkConf  to it:

val sparkContext = new SparkContext(conf)

Then use the sparkContext for interact with the SparkMaster / Cluster.
Your program basically becomes the driver.

HTH.

-Todd

On Sun, Feb 28, 2016 at 9:25 AM, mms  wrote:

> Hi, I cannot find a simple example showing how a typical application can
> 'connect' to a remote spark cluster and interact with it. Let's say I have
> a Python web application hosted somewhere *outside *a spark cluster, with
> just python installed on it. How can I talk to Spark without using a
> notebook, or using ssh to connect to a cluster master node ? I know of
> spark-submit and spark-shell, however forking a process on a remote host to
> execute a shell script seems like a lot of effort What are the recommended
> ways to connect and query Spark from a remote client ? Thanks Thx !
> --
> View this message in context: Spark Integration Patterns
> 
> Sent from the Apache Spark User List mailing list archive
>  at Nabble.com.
>

Re: Saving Kafka Offsets to Cassandra at begining of each batch in Spark Streaming

2016-02-16 Thread Todd Nist

You could use the "withSessionDo" of the SparkCassandrConnector to preform
the simple insert:

CassandraConnector(conf).withSessionDo { session => session.execute() }

-Todd

On Tue, Feb 16, 2016 at 11:01 AM, Cody Koeninger  wrote:

> You could use sc.parallelize... but the offsets are already available at
> the driver, and they're a (hopefully) small enough amount of data that's
> it's probably more straightforward to just use the normal cassandra client
> to save them from the driver.
>
> On Tue, Feb 16, 2016 at 1:15 AM, Abhishek Anand 
> wrote:
>
>> I have a kafka rdd and I need to save the offsets to cassandra table at
>> the begining of each batch.
>>
>> Basically I need to write the offsets of the type Offsets below that I am
>> getting inside foreachRD, to cassandra. The javafunctions api to write to
>> cassandra needs a rdd. How can I create a rdd from offsets and write to
>> cassandra table.
>>
>>
>> public static void writeOffsets(JavaPairDStream> String> kafkastream){
>> kafkastream.foreachRDD((rdd,batchMilliSec) -> {
>> OffsetRange[] offsets = ((HasOffsetRanges) rdd.rdd()).offsetRanges();
>> return null;
>> });
>>
>>
>> Thanks !!
>> Abhi
>>
>>
>>
>

Re: Passing binding variable in query used in Data Source API

2016-01-21 Thread Todd Nist

Hi Satish,

You should be able to do something like this:

   val props = new java.util.Properties()
   props.put("user", username)
   props.put("password",pwd)
   props.put("driver", "org.postgresql.Drive")
   val deptNo = 10
   val where = Some(s"dept_number = $deptNo")
   val df = sqlContext.read.jdbc("jdbc:postgresql://
10.00.00.000:5432/db_test?user=username=password
", "
schema.table1", Array(where.getOrElse("")), props)

or just add the fillter to your query like this and I believe these should
get pushed down.

  val df = sqlContext.read
.format("jdbc")
.option("url", "jdbc:postgresql://
10.00.00.000:5432/db_test?user=username=password
")
.option("user", username)
.option("password", pwd)
.option("driver", "org.postgresql.Driver")
.option("dbtable", "schema.table1")
.load().filter('dept_number === $deptNo)

This is form the top of my head and the code has not been tested or
compiled.

HTH.

-Todd


On Thu, Jan 21, 2016 at 6:02 AM, satish chandra j 
wrote:

> Hi All,
>
> We have requirement to fetch data from source PostgreSQL database as per a
> condition, hence need to pass a binding variable in query used in Data
> Source API as below:
>
>
> var DeptNbr = 10
>
> val dataSource_dF=cc.load("jdbc",Map("url"->"jdbc:postgresql://
> 10.00.00.000:5432/db_test?user=username=password","driver"->"org.postgresql.Driver","dbtable"->"(select*
> from schema.table1 where dept_number=DeptNbr) as table1"))
>
>
> But it errors saying expected ';' but found '='
>
>
> Note: As it is an iterative approach hence cannot use constants but need
> to pass variable to query
>
>
> If anybody had a similar implementation to pass binding variable while
> fetching data from source database using Data Source than please provide
> details on the same
>
>
> Regards,
>
> Satish Chandra
>

Re: NPE when using Joda DateTime

2016-01-14 Thread Todd Nist

I had a similar problem a while back and leveraged these Kryo serializers,
https://github.com/magro/kryo-serializers.  I had to fallback to version
0.28, but that was a while back.  You can add these to the

org.apache.spark.serializer.KryoRegistrator

and then set your registrator in the spark config:

sparkConfig.
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.kryo.registrator", "com.yourpackage.YourKryoRegistrator")
...

where YourKryoRegistrator is something like:

class YourKryoRegistrator extends KryoRegistrator {

  override def registerClasses(kryo: Kryo) {
kryo.register(classOf[org.joda.time.DateTime], new
JodaDateTimeSerializer)
kryo.register(classOf[org.joda.time.Interval], new
JodaIntervalSerializer)
  }
}

HTH.

-Todd

On Thu, Jan 14, 2016 at 9:28 AM, Spencer, Alex (Santander) <
alex.spen...@santander.co.uk.invalid> wrote:

> Hi,
>
> I tried take(1500) and test.collect and these both work on the "single"
> map statement.
>
> I'm very new to Kryo serialisation, I managed to find some code and I
> copied and pasted and that's what originally made the single map statement
> work:
>
> class MyRegistrator extends KryoRegistrator {
>   override def registerClasses(kryo: Kryo) {
> kryo.register(classOf[org.joda.time.DateTime])
>   }
> }
>
> Is it because the groupBy sees a different class type? Maybe
> Array[DateTime]? I don’t want to find the answer by trial and error though.
>
> Alex
>
> -Original Message-
> From: Sean Owen [mailto:so...@cloudera.com]
> Sent: 14 January 2016 14:07
> To: Spencer, Alex (Santander)
> Cc: user@spark.apache.org
> Subject: Re: NPE when using Joda DateTime
>
> It does look somehow like the state of the DateTime object isn't being
> recreated properly on deserialization somehow, given where the NPE occurs
> (look at the Joda source code). However the object is java.io.Serializable.
> Are you sure the Kryo serialization is correct?
>
> It doesn't quite explain why the map operation works by itself. It could
> be the difference between executing locally (take(1) will look at 1
> partition in 1 task which prefers to be local) and executing remotely
> (groupBy is going to need a shuffle).
>
> On Thu, Jan 14, 2016 at 1:01 PM, Spencer, Alex (Santander)
>  wrote:
> > Hello,
> >
> >
> >
> > I was wondering if somebody is able to help me get to the bottom of a
> > null pointer exception I’m seeing in my code. I’ve managed to narrow
> > down a problem in a larger class to my use of Joda’s DateTime
> > functions. I’ve successfully run my code in scala, but I’ve hit a few
> > problems when adapting it to run in spark.
> >
> >
> >
> > Spark version: 1.3.0
> >
> > Scala version: 2.10.4
> >
> > Java HotSpot 1.7
> >
> >
> >
> > I have a small case class called Transaction, which looks something
> > like
> > this:
> >
> >
> >
> > case class Transaction(date : org.joda.time.DateTime = new
> > org.joda.time.DateTime())
> >
> >
> >
> > I have an RDD[Transactions] trans:
> >
> > org.apache.spark.rdd.RDD[Transaction] = MapPartitionsRDD[4] at map at
> > :44
> >
> >
> >
> > I am able to run this successfully:
> >
> >
> >
> > val test = trans.map(_.date.minusYears(10))
> >
> > test.take(1)
> >
> >
> >
> > However if I do:
> >
> >
> >
> > val groupedTrans = trans.groupBy(_.account)
> >
> >
> >
> > //For each group, process transactions in turn:
> >
> > val test = groupedTrans.flatMap { case (_, transList) =>
> >
> >   transList.map {transaction =>
> >
> > transaction.date.minusYears(10)
> >
> >   }
> >
> > }
> >
> > test.take(1)
> >
> >
> >
> > I get:
> >
> >
> >
> > java.lang.NullPointerException
> >
> > at org.joda.time.DateTime.minusYears(DateTime.java:1268)
> >
> >
> >
> > Should the second operation not be equivalent to the first .map one?
> > (It’s a long way round of producing my error – but it’s extremely
> > similar to what’s happening in my class).
> >
> >
> >
> > I’ve got a custom registration class for Kryo which I think is working
> > - before I added this the original .map did not work – but shouldn’t
> > it be able to serialize all instances of Joda DateTime?
> >
> >
> >
> > Thank you for any help / pointers you can give me.
> >
> >
> >
> > Kind Regards,
> >
> > Alex.
> >
> >
> >
> > Alex Spencer
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
> commands, e-mail: user-h...@spark.apache.org
>
> Emails aren't always secure, and they may be intercepted or changed after
> they've been sent. Santander doesn't accept liability if this happens. If
> you
> think someone may have interfered with this email, please get in touch
> with the
> sender another way. This message doesn't create or change any contract.
> Santander doesn't accept responsibility for damage caused by any viruses
> contained in this email or its attachments. Emails may be monitored. If
> you've
>

Re: GroupBy on DataFrame taking too much time

2016-01-11 Thread Todd Nist

Hi Rajeshwar Gaini,

dbtable can be any valid sql query, simple define it as a sub query,
something like:


  val query = "(SELECT country, count(*) FROM customer group by country) as
X"

  val df1 = sqlContext.read
.format("jdbc")
.option("url", url)
.option("user", username)
.option("password", pwd)
.option("driver", "driverClassNameHere")
.option("dbtable", query)
.load()

Not sure if that's what your looking for or not.

HTH.

-Todd

On Mon, Jan 11, 2016 at 3:47 AM, Gaini Rajeshwar <
raja.rajeshwar2...@gmail.com> wrote:

> There is no problem with the sql read. When i do the following it is
> working fine.
>
> *val dataframe1 = sqlContext.load("jdbc", Map("url" ->
> "jdbc:postgresql://localhost/customerlogs?user=postgres=postgres",
> "dbtable" -> "customer"))*
>
> *dataframe1.filter("country = 'BA'").show()*
>
> On Mon, Jan 11, 2016 at 1:41 PM, Xingchi Wang  wrote:
>
>> Error happend at the "Lost task 0.0 in stage 0.0", I think it is not the
>> "groupBy" problem, it's the sql read the "customer" table issue,
>> please check the jdbc link and the data is loaded successfully??
>>
>> Thanks
>> Xingchi
>>
>> 2016-01-11 15:43 GMT+08:00 Gaini Rajeshwar 
>> :
>>
>>> Hi All,
>>>
>>> I have a table named *customer *(customer_id, event, country,  ) in
>>> postgreSQL database. This table is having more than 100 million rows.
>>>
>>> I want to know number of events from each country. To achieve that i am
>>> doing groupBY using spark as following.
>>>
>>> *val dataframe1 = sqlContext.load("jdbc", Map("url" ->
>>> "jdbc:postgresql://localhost/customerlogs?user=postgres=postgres",
>>> "dbtable" -> "customer"))*
>>>
>>>
>>> *dataframe1.groupBy("country").count().show()*
>>>
>>> above code seems to be getting complete customer table before doing
>>> groupBy. Because of that reason it is throwing the following error
>>>
>>> *16/01/11 12:49:04 WARN HeartbeatReceiver: Removing executor 0 with no
>>> recent heartbeats: 170758 ms exceeds timeout 12 ms*
>>> *16/01/11 12:49:04 ERROR TaskSchedulerImpl: Lost executor 0 on
>>> 10.2.12.59 : Executor heartbeat timed out after 170758
>>> ms*
>>> *16/01/11 12:49:04 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID
>>> 0, 10.2.12.59): ExecutorLostFailure (executor 0 exited caused by one of the
>>> running tasks) Reason: Executor heartbeat timed out after 170758 ms*
>>>
>>> I am using spark 1.6.0
>>>
>>> Is there anyway i can solve this ?
>>>
>>> Thanks,
>>> Rajeshwar Gaini.
>>>
>>
>>
>

Re: write new data to mysql

2016-01-08 Thread Todd Nist

Sorry, did not see your update until now.

On Fri, Jan 8, 2016 at 3:52 PM, Todd Nist <tsind...@gmail.com> wrote:

> Hi Yasemin,
>
> What version of Spark are you using?  Here is the reference, it is off of
> the DataFrame
> https://spark.apache.org/docs/latest/api/java/index.html#org.apache.spark.sql.DataFrame
>  and provides a DataFrameWriter,
> https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameWriter.html
> :
>
> DataFrameWriter
> <https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameWriter.html>
>  *write
> <https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrame.html#write()>*
> ()
> Interface for saving the content of the DataFrame
> <https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrame.html>
>  out
> into external storage.
>
> It is the very last method defined there in the api docs.
>
> HTH.
>
> -Todd
>
>
> On Fri, Jan 8, 2016 at 2:27 PM, Yasemin Kaya <godo...@gmail.com> wrote:
>
>> Hi,
>> There is no write function that Todd mentioned or i cant find it.
>> The code and error are in gist
>> <https://gist.github.com/yaseminn/f5a2b78b126df71dfd0b>. Could you check
>> it out please?
>>
>> Best,
>> yasemin
>>
>> 2016-01-08 18:23 GMT+02:00 Todd Nist <tsind...@gmail.com>:
>>
>>> It is not clear from the information provided why the insertIntoJDBC
>>> failed in #2.  I would note that method on the DataFrame as been deprecated
>>> since 1.4, not sure what version your on.  You should be able to do
>>> something like this:
>>>
>>>  DataFrame.write.mode(SaveMode.Append).jdbc(MYSQL_CONNECTION_URL_WRITE,
>>> "track_on_alarm", connectionProps)
>>>
>>> HTH.
>>>
>>> -Todd
>>>
>>> On Fri, Jan 8, 2016 at 10:53 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>
>>>> Which Spark release are you using ?
>>>>
>>>> For case #2, was there any error / clue in the logs ?
>>>>
>>>> Cheers
>>>>
>>>> On Fri, Jan 8, 2016 at 7:36 AM, Yasemin Kaya <godo...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I want to write dataframe existing mysql table, but when i use
>>>>> *peopleDataFrame.insertIntoJDBC(MYSQL_CONNECTION_URL_WRITE,
>>>>> "track_on_alarm",false)*
>>>>>
>>>>> it says "Table track_on_alarm already exists."
>>>>>
>>>>> And when i *use peopleDataFrame.insertIntoJDBC(MYSQL_CONNECTION_URL_WRITE,
>>>>> "track_on_alarm",true)*
>>>>>
>>>>> i lost the existing data.
>>>>>
>>>>> How i can write new data to db?
>>>>>
>>>>> Best,
>>>>> yasemin
>>>>>
>>>>> --
>>>>> hiç ender hiç
>>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> hiç ender hiç
>>
>
>

Re: write new data to mysql

2016-01-08 Thread Todd Nist

Hi Yasemin,

What version of Spark are you using?  Here is the reference, it is off of
the DataFrame
https://spark.apache.org/docs/latest/api/java/index.html#org.apache.spark.sql.DataFrame
 and provides a DataFrameWriter,
https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameWriter.html
:

DataFrameWriter
<https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameWriter.html>
 *write
<https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrame.html#write()>*
()
Interface for saving the content of the DataFrame
<https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrame.html>
out
into external storage.

It is the very last method defined there in the api docs.

HTH.

-Todd

On Fri, Jan 8, 2016 at 2:27 PM, Yasemin Kaya <godo...@gmail.com> wrote:

> Hi,
> There is no write function that Todd mentioned or i cant find it.
> The code and error are in gist
> <https://gist.github.com/yaseminn/f5a2b78b126df71dfd0b>. Could you check
> it out please?
>
> Best,
> yasemin
>
> 2016-01-08 18:23 GMT+02:00 Todd Nist <tsind...@gmail.com>:
>
>> It is not clear from the information provided why the insertIntoJDBC
>> failed in #2.  I would note that method on the DataFrame as been deprecated
>> since 1.4, not sure what version your on.  You should be able to do
>> something like this:
>>
>>  DataFrame.write.mode(SaveMode.Append).jdbc(MYSQL_CONNECTION_URL_WRITE,
>> "track_on_alarm", connectionProps)
>>
>> HTH.
>>
>> -Todd
>>
>> On Fri, Jan 8, 2016 at 10:53 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>>
>>> Which Spark release are you using ?
>>>
>>> For case #2, was there any error / clue in the logs ?
>>>
>>> Cheers
>>>
>>> On Fri, Jan 8, 2016 at 7:36 AM, Yasemin Kaya <godo...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I want to write dataframe existing mysql table, but when i use
>>>> *peopleDataFrame.insertIntoJDBC(MYSQL_CONNECTION_URL_WRITE,
>>>> "track_on_alarm",false)*
>>>>
>>>> it says "Table track_on_alarm already exists."
>>>>
>>>> And when i *use peopleDataFrame.insertIntoJDBC(MYSQL_CONNECTION_URL_WRITE,
>>>> "track_on_alarm",true)*
>>>>
>>>> i lost the existing data.
>>>>
>>>> How i can write new data to db?
>>>>
>>>> Best,
>>>> yasemin
>>>>
>>>> --
>>>> hiç ender hiç
>>>>
>>>
>>>
>>
>
>
> --
> hiç ender hiç
>

Re: write new data to mysql

2016-01-08 Thread Todd Nist

It is not clear from the information provided why the insertIntoJDBC failed
in #2.  I would note that method on the DataFrame as been deprecated since
1.4, not sure what version your on.  You should be able to do something
like this:

 DataFrame.write.mode(SaveMode.Append).jdbc(MYSQL_CONNECTION_URL_WRITE,
"track_on_alarm", connectionProps)

HTH.

-Todd

On Fri, Jan 8, 2016 at 10:53 AM, Ted Yu  wrote:

> Which Spark release are you using ?
>
> For case #2, was there any error / clue in the logs ?
>
> Cheers
>
> On Fri, Jan 8, 2016 at 7:36 AM, Yasemin Kaya  wrote:
>
>> Hi,
>>
>> I want to write dataframe existing mysql table, but when i use
>> *peopleDataFrame.insertIntoJDBC(MYSQL_CONNECTION_URL_WRITE,
>> "track_on_alarm",false)*
>>
>> it says "Table track_on_alarm already exists."
>>
>> And when i *use peopleDataFrame.insertIntoJDBC(MYSQL_CONNECTION_URL_WRITE,
>> "track_on_alarm",true)*
>>
>> i lost the existing data.
>>
>> How i can write new data to db?
>>
>> Best,
>> yasemin
>>
>> --
>> hiç ender hiç
>>
>
>

Re: problem building spark on centos

2016-01-06 Thread Todd Nist

That should read "I think your missing the --name option".  Sorry about
that.

On Wed, Jan 6, 2016 at 3:03 PM, Todd Nist <tsind...@gmail.com> wrote:

> Hi Jade,
>
> I think you "--name" option. The makedistribution should look like this:
>
> ./make-distribution.sh --name hadoop-2.6 --tgz -Pyarn -Phadoop-2.6
> -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver -DskipTests.
>
> As for why it failed to build with scala 2.11, did you run the
> ./dev/change-scala-version.sh 2.11 script to set the version of the
> artifacts to 2.11?  If you do that then issue the build like this I think
> you will be ok:
>
> ./make-distribution.sh --name hadoop-2.6_scala-2.11 --tgz -Pyarn
> -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver -Dscala-2.11
> -DskipTests
>
> HTH.
>
> -Todd
>
> On Wed, Jan 6, 2016 at 2:20 PM, Jade Liu <jade@nor1.com> wrote:
>
>> I’ve changed the scala version to 2.10.
>>
>> With this command:
>> build/mvn -X -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests clean
>> package
>> Build was successful.
>>
>> But make a runnable version:
>> /make-distribution.sh --tgz -Phadoop-2.6 -Pyarn -Dhadoop.version=2.6.0
>>  -Phive -Phive-thriftserver –DskipTests
>> Still fails with the following error:
>> [ERROR] Failed to execute goal
>> net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first)
>> on project spark-launcher_2.10: Execution scala-compile-first of goal
>> net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed. CompileFailed
>> -> [Help 1]
>> org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
>> goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile
>> (scala-compile-first) on project spark-launcher_2.10: Execution
>> scala-compile-first of goal
>> net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed.
>> at
>> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:212)
>> at
>> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
>> at
>> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
>> at
>> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116)
>> at
>> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80)
>> at
>> org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51)
>> at
>> org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
>> at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:307)
>> at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:193)
>> at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:106)
>> at org.apache.maven.cli.MavenCli.execute(MavenCli.java:863)
>> at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:288)
>> at org.apache.maven.cli.MavenCli.main(MavenCli.java:199)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:606)
>> at
>> org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
>> at
>> org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
>> at
>> org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
>> at
>> org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
>> Caused by: org.apache.maven.plugin.PluginExecutionException: Execution
>> scala-compile-first of goal
>> net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed.
>> at
>> org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:145)
>> at
>> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:207)
>> ... 20 more
>> Caused by: Compile failed via zinc server
>> at
>> sbt_inc.SbtIncrementalCompiler.zincCompile(SbtIncrementalCompiler.java:136)
>> at sbt_inc.SbtIncrementalCompiler.compile(SbtIncrementalCompiler.java:86)
>> at
>> scala_maven.ScalaCompilerSupport.incrementalCompile(ScalaCompilerSupport.java:303)
>> at scala_maven.ScalaCompilerSupport.compile(ScalaCompilerSupport.java:119)
>> at
>> scala_maven.ScalaCompilerSupport.doExecute(ScalaCompilerSupport.java:99)
>> at scala_maven.ScalaMojoSupport.execute(

Re: problem building spark on centos

2016-01-06 Thread Todd Nist

Hi Jade,

I think you "--name" option. The makedistribution should look like this:

./make-distribution.sh --name hadoop-2.6 --tgz -Pyarn -Phadoop-2.6
-Dhadoop.version=2.6.0 -Phive -Phive-thriftserver -DskipTests.

As for why it failed to build with scala 2.11, did you run the
./dev/change-scala-version.sh 2.11 script to set the version of the
artifacts to 2.11?  If you do that then issue the build like this I think
you will be ok:

./make-distribution.sh --name hadoop-2.6_scala-2.11 --tgz -Pyarn
-Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver -Dscala-2.11
-DskipTests

HTH.

-Todd

On Wed, Jan 6, 2016 at 2:20 PM, Jade Liu  wrote:

> I’ve changed the scala version to 2.10.
>
> With this command:
> build/mvn -X -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests clean
> package
> Build was successful.
>
> But make a runnable version:
> /make-distribution.sh --tgz -Phadoop-2.6 -Pyarn -Dhadoop.version=2.6.0
>  -Phive -Phive-thriftserver –DskipTests
> Still fails with the following error:
> [ERROR] Failed to execute goal
> net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first)
> on project spark-launcher_2.10: Execution scala-compile-first of goal
> net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed. CompileFailed
> -> [Help 1]
> org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
> goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile
> (scala-compile-first) on project spark-launcher_2.10: Execution
> scala-compile-first of goal
> net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed.
> at
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:212)
> at
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
> at
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
> at
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116)
> at
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80)
> at
> org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51)
> at
> org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
> at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:307)
> at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:193)
> at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:106)
> at org.apache.maven.cli.MavenCli.execute(MavenCli.java:863)
> at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:288)
> at org.apache.maven.cli.MavenCli.main(MavenCli.java:199)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
> Caused by: org.apache.maven.plugin.PluginExecutionException: Execution
> scala-compile-first of goal
> net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed.
> at
> org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:145)
> at
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:207)
> ... 20 more
> Caused by: Compile failed via zinc server
> at
> sbt_inc.SbtIncrementalCompiler.zincCompile(SbtIncrementalCompiler.java:136)
> at sbt_inc.SbtIncrementalCompiler.compile(SbtIncrementalCompiler.java:86)
> at
> scala_maven.ScalaCompilerSupport.incrementalCompile(ScalaCompilerSupport.java:303)
> at scala_maven.ScalaCompilerSupport.compile(ScalaCompilerSupport.java:119)
> at scala_maven.ScalaCompilerSupport.doExecute(ScalaCompilerSupport.java:99)
> at scala_maven.ScalaMojoSupport.execute(ScalaMojoSupport.java:482)
> at
> org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:134)
>
> Not sure what’s causing it. Does anyone have any idea?
>
> Thanks!
>
> Jade
> From: Ted Yu 
> Date: Wednesday, January 6, 2016 at 10:40 AM
> To: Jade Liu , user 
>
> Subject: Re: problem building spark on centos
>
> w.r.t. the second error, have you read this ?
>
> http://www.captaindebug.com/2013/03/mavens-non-resolvable-parent-pom-problem.html#.Vo1fFGSrSuo
>
> On Wed, Jan 6, 2016 at 9:49 AM, Jade Liu  wrote:
>
>> I’m using 3.3.9. Thanks!
>>
>> Jade
>>
>> From: Ted Yu 
>> Date: Tuesday, January 5, 2016 at 4:57 PM
>> To: Jade Liu 
>> Cc:

Re: problem building spark on centos

2016-01-06 Thread Todd Nist

Not sure, I just built it with java 8, but 7 is supported so that should be
fine.  Are you using maven 3.3.3 + ?

RADTech:spark-1.5.2 tnist$ mvn -version
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option
MaxPermSize=512m; support was removed in 8.0
Apache Maven 3.3.3 (7994120775791599e205a5524ec3e0dfe41d4a06;
2015-04-22T07:57:37-04:00)
Maven home: /usr/local/maven
Java version: 1.8.0_51, vendor: Oracle Corporation
Java home:
/Library/Java/JavaVirtualMachines/jdk1.8.0_51.jdk/Contents/Home/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "mac os x", version: "10.10.5", arch: "x86_64", family: "mac"



On Wed, Jan 6, 2016 at 3:27 PM, Jade Liu <jade@nor1.com> wrote:

> Hi, Todd:
>
> Thanks for your suggestion. Yes I did run the
> ./dev/change-scala-version.sh 2.11 script when using scala version 2.11.
>
> I just tried this as you suggested:
> ./make-distribution.sh --name hadoop-2.6 --tgz -Pyarn -Phadoop-2.6
> -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver –DskipTests
>
> Still got the same error:
> [ERROR] Failed to execute goal
> net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first)
> on project spark-launcher_2.10: Execution scala-compile-first of goal
> net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed. CompileFailed
> -> [Help 1]
> org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
> goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile
> (scala-compile-first) on project spark-launcher_2.10: Execution
> scala-compile-first of goal
> net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed.
> at
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:212)
> at
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
> at
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
> at
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116)
> at
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80)
> at
> org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51)
> at
> org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
> at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:307)
> at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:193)
> at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:106)
> at org.apache.maven.cli.MavenCli.execute(MavenCli.java:863)
> at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:288)
> at org.apache.maven.cli.MavenCli.main(MavenCli.java:199)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
> Caused by: org.apache.maven.plugin.PluginExecutionException: Execution
> scala-compile-first of goal
> net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed.
> at
> org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:145)
> at
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:207)
> ... 20 more
> Caused by: Compile failed via zinc server
> at
> sbt_inc.SbtIncrementalCompiler.zincCompile(SbtIncrementalCompiler.java:136)
> at sbt_inc.SbtIncrementalCompiler.compile(SbtIncrementalCompiler.java:86)
> at
> scala_maven.ScalaCompilerSupport.incrementalCompile(ScalaCompilerSupport.java:303)
> at scala_maven.ScalaCompilerSupport.compile(ScalaCompilerSupport.java:119)
> at scala_maven.ScalaCompilerSupport.doExecute(ScalaCompilerSupport.java:99)
> at scala_maven.ScalaMojoSupport.execute(ScalaMojoSupport.java:482)
> at
> org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:134)
> ... 21 more
> [ERROR]
> [ERROR]
> [ERROR] For more information about the errors and possible solutions,
> please read the following articles:
> [ERROR] [Help 1]
> http://cwiki.apache.org/confluence/display/MAVEN/PluginExecutionException
> [ERROR]
> [ERROR] After correcting the problems, you can resume the build with the
> command
> [ERROR]   mvn  -rf :spark-launch

Re: looking for a easier way to count the number of items in a JavaDStream

2015-12-16 Thread Todd Nist

Another possible alternative is to register a StreamingListener and then
reference the BatchInfo.numRecords; good example here,
https://gist.github.com/akhld/b10dc491aad1a2007183.

After registering the listener, Simply implement the appropriate "onEvent"
method where onEvent is onBatchStarted, onBatchCompleted, ..., for example:

public void onBatchCompleted(StreamingListenerBatchCompleted batchCompleted)
{ System.out.println("Batch completed, Total records :" + batchCompleted.
batchInfo().numRecords().get().toString()); } That should be very efficient
and avoid any collects(), just to obtain the count of records on the
DStream.

HTH.

-Todd

On Wed, Dec 16, 2015 at 3:34 PM, Bryan Cutler  wrote:

> To follow up with your other issue, if you are just trying to count
> elements in a DStream, you can do that without an Accumulator.  foreachRDD
> is meant to be an output action, it does not return anything and it is
> actually run in the driver program.  Because Java (before 8) handles
> closures a little differently, it might be easiest to implement the
> function to pass to foreachRDD as something like this:
>
> class MyFunc implements VoidFunction {
>
>   public long total = 0;
>
>   @Override
>   public void call(JavaRDD rdd) {
> System.out.println("foo " + rdd.collect().toString());
> total += rdd.count();
>   }
> }
>
> MyFunc f = new MyFunc();
>
> inputStream.foreachRDD(f);
>
> // f.total will have the count of all RDDs
>
> Hope that helps some!
>
> -bryan
>
> On Wed, Dec 16, 2015 at 8:37 AM, Bryan Cutler  wrote:
>
>> Hi Andy,
>>
>> Regarding the foreachrdd return value, this Jira that will be in 1.6
>> should take care of that https://issues.apache.org/jira/browse/SPARK-4557
>> and make things a little simpler.
>> On Dec 15, 2015 6:55 PM, "Andy Davidson" 
>> wrote:
>>
>>> I am writing  a JUnit test for some simple streaming code. I want to
>>> make assertions about how many things are in a given JavaDStream. I wonder
>>> if there is an easier way in Java to get the count?
>>>
>>> I think there are two points of friction.
>>>
>>>
>>>1. is it easy to create an accumulator of type double or int, How
>>>ever Long is not supported
>>>2. We need to use javaDStream.foreachRDD. The Function interface
>>>must return void. I was not able to define an accumulator in my driver
>>>and use a lambda function. (I am new to lambda in Java)
>>>
>>> Here is a little lambda example that logs my test objects. I was not
>>> able to figure out how to get  to return a value or access a accumulator
>>>
>>>data.foreachRDD(rdd -> {
>>>
>>> logger.info(“Begin data.foreachRDD" );
>>>
>>> for (MyPojo pojo : rdd.collect()) {
>>>
>>> logger.info("\n{}", pojo.toString());
>>>
>>> }
>>>
>>> return null;
>>>
>>> });
>>>
>>>
>>> Any suggestions would be greatly appreciated
>>>
>>> Andy
>>>
>>> This following code works in my driver but is a lot of code for such a
>>> trivial computation. Because it needs to the JavaSparkContext I do not
>>> think it would work inside a closure. I assume the works do not have access
>>> to the context as a global and that it shipping it in the closure is not a
>>> good idea?
>>>
>>> public class JavaDStreamCount implements Serializable {
>>>
>>> private static final long serialVersionUID = -3600586183332429887L;
>>>
>>> public static Logger logger =
>>> LoggerFactory.getLogger(JavaDStreamCount.class);
>>>
>>>
>>>
>>> public Double hack(JavaSparkContext sc, JavaDStream javaDStream)
>>> {
>>>
>>> Count c = new Count(sc);
>>>
>>> javaDStream.foreachRDD(c);
>>>
>>> return c.getTotal().value();
>>>
>>> }
>>>
>>>
>>>
>>> class Count implements Function {
>>>
>>> private static final long serialVersionUID =
>>> -5239727633710162488L;
>>>
>>> Accumulator total;
>>>
>>>
>>>
>>> public Count(JavaSparkContext sc) {
>>>
>>> total = sc.accumulator(0.0);
>>>
>>> }
>>>
>>>
>>>
>>> @Override
>>>
>>> public java.lang.Void call(JavaRDD rdd) throws Exception {
>>>
>>> List data = rdd.collect();
>>>
>>> int dataSize = data.size();
>>>
>>> logger.error("data.size:{}", dataSize);
>>>
>>> long num = rdd.count();
>>>
>>> logger.error("num:{}", num);
>>>
>>> total.add(new Double(num));
>>>
>>> return null;
>>>
>>> }
>>>
>>>
>>> public Accumulator getTotal() {
>>>
>>> return total;
>>>
>>> }
>>>
>>> }
>>>
>>> }
>>>
>>>
>>>
>>>
>>>
>

Re: Securing objects on the thrift server

2015-12-15 Thread Todd Nist

see https://issues.apache.org/jira/browse/SPARK-11043, it is resolved in
1.6.

On Tue, Dec 15, 2015 at 2:28 PM, Younes Naguib <
younes.nag...@tritondigital.com> wrote:

> The one coming with spark 1.5.2.
>
>
>
> y
>
>
>
> *From:* Ted Yu [mailto:yuzhih...@gmail.com]
> *Sent:* December-15-15 1:59 PM
> *To:* Younes Naguib
> *Cc:* user@spark.apache.org
> *Subject:* Re: Securing objects on the thrift server
>
>
>
> Which Hive release are you using ?
>
>
>
> Please take a look at HIVE-8529
>
>
>
> Cheers
>
>
>
> On Tue, Dec 15, 2015 at 8:25 AM, Younes Naguib <
> younes.nag...@tritondigital.com> wrote:
>
> Hi all,
>
>
> I get this error when running "show current roles;"  :
>
> 2015-12-15 15:50:41 WARN
> org.apache.hive.service.cli.thrift.ThriftCLIService ThriftCLIService:681 -
> Error fetching results:
> org.apache.hive.service.cli.HiveSQLException: Couldn't find log associated
> with operation handle: OperationHandle [opType=EXECUTE_STATEMENT,
> getHandleIdentifier()=cd309366-839f-468c-add1-81a506e92254]
> at
> org.apache.hive.service.cli.operation.OperationManager.getOperationLogRowSet(OperationManager.java:229)
> at
> org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:687)
> at
> org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:454)
> at
> org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:672)
> at
> org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1553)
> at
> org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1538)
> at
> org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
> at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
> at
> org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:56)
> at
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:285)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
> Any ideas?
>
> Thanks,
>
> *Younes * 
>
>
>

Re: [Spark Streaming] How to clear old data from Stream State?

2015-11-25 Thread Todd Nist

Perhaps the new trackStateByKey targeted for very 1.6 may help you here.
I'm not sure if it is part of 1.6 or not for sure as the jira does not
specify a fixed version.  The jira describing it is here:
https://issues.apache.org/jira/browse/SPARK-2629, and the design doc that
discusses the API changes is here:

https://docs.google.com/document/d/1NoALLyd83zGs1hNGMm0Pc5YOVgiPpMHugGMk6COqxxE/edit#

Look for the timeout  function:

/**

  * Set the duration of inactivity (i.e. no new data) after which a state

  * can be terminated by the system. After this idle period, the system

  * will mark the idle state as being timed out, and call the tracking

  * function with State[S].isTimingOut() = true.

  */

 def timeout(duration: Duration): this.type

-Todd

On Wed, Nov 25, 2015 at 8:00 AM, diplomatic Guru 
wrote:

> Hello,
>
> I know how I could clear the old state depending on the input value. If
> some condition matches to determine that the state is old then set the
> return null, will invalidate the record. But this is only feasible if a new
> record arrives that matches the old key. What if no new data arrives for
> the old data, how could I make that invalid.
>
> e.g.
>
> A key/Value arrives like this
>
> Key 12-11-2015:10:00: Value:test,1,2,12-11-2015:10:00
>
> Above key will be updated to state.
>
> Every time there is a value for this '12-11-2015:10:00' key, it will be
> aggregated and updated. If the job is running for 24/7, then this state
> will be kept forever until we restart the job. But I could have a
> validation within the updateStateByKey function to check and delete the
> record if value[3]< SYSTIME-1. But this only effective if a new record
> arrives that matches the 12-11-2015:10:00 in the later days. What if no new
> values are received for this key:12-11-2015:10:00. I assume it will remain
> in the state, am I correct? if so the how do I clear the state?
>
> Thank you.
>
>
>

Re: Spark Driver Port Details

2015-11-25 Thread Todd Nist

The default is to start applications with port 4040 and then increment them
by 1 as you are seeing, see docs here:
http://spark.apache.org/docs/latest/monitoring.html#web-interfaces

You can override this behavior by setting passing the  --conf
spark.ui.port=4080 or in your code; something like this:

val conf = new SparkConf().setAppName(s"YourApp").set("spark.ui.port", "4080")
val sc = new SparkContext(conf)

While there is a rest api to return you information on the application,
http://yourserver:8080/api/v1/applications, it does not return the port
used by the application.

-Todd

On Wed, Nov 25, 2015 at 9:15 AM, aman solanki 
wrote:

>
> Hi,
>
> Can anyone tell me how i can get the details that a particular spark
> application is running on which particular port?
>
> For Example:
>
> I have two applications A and B
>
> A is running on 4040
> B is running on 4041
>
> How can i get these application port mapping? Is there a rest call or
> environment variable for the same?
>
> Please share your findings for standalone mode.
>
> Thanks,
> Aman Solanki
>

Re: Getting the batch time of the active batches in spark streaming

2015-11-24 Thread Todd Nist

Hi Abhi,

You should be able to register a
org.apache.spark.streaming.scheduler.StreamListener.

There is an example here that may help:
https://gist.github.com/akhld/b10dc491aad1a2007183 and the spark api docs
here,
http://spark.apache.org/docs/latest/api/java/org/apache/spark/scheduler/SparkListener.html
.

HTH,
-Todd

On Tue, Nov 24, 2015 at 4:50 PM, Abhishek Anand 
wrote:

> Hi ,
>
> I need to get the batch time of the active batches which appears on the UI
> of spark streaming tab,
>
> How can this be achieved in Java ?
>
> BR,
> Abhi
>

Re: Getting the batch time of the active batches in spark streaming

2015-11-24 Thread Todd Nist

Hi Abhi,

Sorry that was the wrong link should have been the StreamListener,
http://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/scheduler/StreamingListener.html

The BatchInfo can be obtained from the event, for example:

public void onBatchSubmitted(StreamingListenerBatchSubmitted batchSubmitted)
{ system.out.println("Start time: " +
batchSubmitted.batchInfo.processingStartTime)
}

Sorry for the confusion.

-Todd

On Tue, Nov 24, 2015 at 7:51 PM, Todd Nist <tsind...@gmail.com> wrote:

> Hi Abhi,
>
> You should be able to register a
> org.apache.spark.streaming.scheduler.StreamListener.
>
> There is an example here that may help:
> https://gist.github.com/akhld/b10dc491aad1a2007183 and the spark api docs
> here,
> http://spark.apache.org/docs/latest/api/java/org/apache/spark/scheduler/SparkListener.html
> .
>
> HTH,
> -Todd
>
> On Tue, Nov 24, 2015 at 4:50 PM, Abhishek Anand <abhis.anan...@gmail.com>
> wrote:
>
>> Hi ,
>>
>> I need to get the batch time of the active batches which appears on the
>> UI of spark streaming tab,
>>
>> How can this be achieved in Java ?
>>
>> BR,
>> Abhi
>>
>
>

Re: Maven build failed (Spark master)

2015-10-27 Thread Todd Nist

I issued the same basic command and it worked fine.

RADTech-MBP:spark $ ./make-distribution.sh --name hadoop-2.6 --tgz -Pyarn
-Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver -DskipTests

Which created: spark-1.6.0-SNAPSHOT-bin-hadoop-2.6.tgz in the root
directory of the project.

FWIW, the environment was an MBP with OS X 10.10.5 and Java:

java version "1.8.0_51"
Java(TM) SE Runtime Environment (build 1.8.0_51-b16)
Java HotSpot(TM) 64-Bit Server VM (build 25.51-b03, mixed mode)

-Todd

On Tue, Oct 27, 2015 at 12:17 PM, Ted Yu  wrote:

> I used the following command:
> make-distribution.sh --name custom-spark --tgz -Phadoop-2.4 -Phive
> -Phive-thriftserver -Pyarn
>
> spark-1.6.0-SNAPSHOT-bin-custom-spark.tgz was generated (with patch from
> SPARK-11348)
>
> Can you try above command ?
>
> Thanks
>
> On Tue, Oct 27, 2015 at 7:03 AM, Kayode Odeyemi  wrote:
>
>> Ted, I switched to this:
>>
>> ./make-distribution.sh --name spark-latest --tgz -Dhadoop.version=2.6.0
>> -Phadoop-2.6 -Phive -Phive-thriftserver -Pyarn -DskipTests clean package -U
>>
>> Same error. No .gz file. Here's the bottom output log:
>>
>> + rm -rf /home/emperor/javaprojects/spark/dist
>> + mkdir -p /home/emperor/javaprojects/spark/dist/lib
>> + echo 'Spark [WARNING] See
>> http://docs.codehaus.org/display/MAVENUSER/Shade+Plugin (git revision
>> 3689beb) built for Hadoop [WARNING] See
>> http://docs.codehaus.org/display/MAVENUSER/Shade+Pl
>> + echo 'Build flags: -Dhadoop.version=2.6.0' -Phadoop-2.6 -Phive
>> -Phive-thriftserver -Pyarn -DskipTests clean package -U
>> + cp
>> /home/emperor/javaprojects/spark/assembly/target/scala-2.10/spark-assembly-1.6.0-SNAPSHOT-hadoop2.6.0.jar
>> /home/emperor/javaprojects/spark/dist/lib/
>> + cp
>> /home/emperor/javaprojects/spark/examples/target/scala-2.10/spark-examples-1.6.0-SNAPSHOT-hadoop2.6.0.jar
>> /home/emperor/javaprojects/spark/dist/lib/
>> + cp
>> /home/emperor/javaprojects/spark/network/yarn/target/scala-2.10/spark-1.6.0-SNAPSHOT-yarn-shuffle.jar
>> /home/emperor/javaprojects/spark/dist/lib/
>> + mkdir -p /home/emperor/javaprojects/spark/dist/examples/src/main
>> + cp -r /home/emperor/javaprojects/spark/examples/src/main
>> /home/emperor/javaprojects/spark/dist/examples/src/
>> + '[' 1 == 1 ']'
>> + cp
>> /home/emperor/javaprojects/spark/lib_managed/jars/datanucleus-api-jdo-3.2.6.jar
>> /home/emperor/javaprojects/spark/lib_managed/jars/datanucleus-core-3.2.10.jar
>> /home/emperor/javaprojects
>> ed/jars/datanucleus-rdbms-3.2.9.jar
>> /home/emperor/javaprojects/spark/dist/lib/
>> + cp /home/emperor/javaprojects/spark/LICENSE
>> /home/emperor/javaprojects/spark/dist
>> + cp -r /home/emperor/javaprojects/spark/licenses
>> /home/emperor/javaprojects/spark/dist
>> + cp /home/emperor/javaprojects/spark/NOTICE
>> /home/emperor/javaprojects/spark/dist
>> + '[' -e /home/emperor/javaprojects/spark/CHANGES.txt ']'
>> + cp -r /home/emperor/javaprojects/spark/data
>> /home/emperor/javaprojects/spark/dist
>> + mkdir /home/emperor/javaprojects/spark/dist/conf
>> + cp /home/emperor/javaprojects/spark/conf/docker.properties.template
>> /home/emperor/javaprojects/spark/conf/fairscheduler.xml.template
>> /home/emperor/javaprojects/spark/conf/log4j.properties
>> emperor/javaprojects/spark/conf/metrics.properties.template
>> /home/emperor/javaprojects/spark/conf/slaves.template
>> /home/emperor/javaprojects/spark/conf/spark-defaults.conf.template /home/em
>> ts/spark/conf/spark-env.sh.template
>> /home/emperor/javaprojects/spark/dist/conf
>> + cp /home/emperor/javaprojects/spark/README.md
>> /home/emperor/javaprojects/spark/dist
>> + cp -r /home/emperor/javaprojects/spark/bin
>> /home/emperor/javaprojects/spark/dist
>> + cp -r /home/emperor/javaprojects/spark/python
>> /home/emperor/javaprojects/spark/dist
>> + cp -r /home/emperor/javaprojects/spark/sbin
>> /home/emperor/javaprojects/spark/dist
>> + cp -r /home/emperor/javaprojects/spark/ec2
>> /home/emperor/javaprojects/spark/dist
>> + '[' -d /home/emperor/javaprojects/spark/R/lib/SparkR ']'
>> + '[' false == true ']'
>> + '[' true == true ']'
>> + TARDIR_NAME='spark-[WARNING] See
>> http://docs.codehaus.org/display/MAVENUSER/Shade+Plugin-bin-spark-latest'
>> + TARDIR='/home/emperor/javaprojects/spark/spark-[WARNING] See
>> http://docs.codehaus.org/display/MAVENUSER/Shade+Plugin-bin-spark-latest'
>> + rm -rf '/home/emperor/javaprojects/spark/spark-[WARNING] See
>> http://docs.codehaus.org/display/MAVENUSER/Shade+Plugin-bin-spark-latest'
>> + cp -r /home/emperor/javaprojects/spark/dist
>> '/home/emperor/javaprojects/spark/spark-[WARNING] See
>> http://docs.codehaus.org/display/MAVENUSER/Shade+Plugin-bin-spark-latest'
>> cp: cannot create directory
>> `/home/emperor/javaprojects/spark/spark-[WARNING] See
>> http://docs.codehaus.org/display/MAVENUSER/Shade+Plugin-bin-spark-latest':
>> No such file or directory
>>
>>
>> On Tue, Oct 27, 2015 at 2:14 PM, Ted Yu  wrote:
>>
>>> Can you try the

Re: Newbie Help for spark compilation problem

2015-10-25 Thread Todd Nist

So yes the individual artifacts are released however, there is no
deployable bundle prebuilt for Spark 1.5.1 and Scala 2.11.7, something
like:  spark-1.5.1-bin-hadoop-2.6_scala-2.11.tgz.  The spark site even
states this:

*Note: Scala 2.11 users should download the Spark source package and
build with Scala 2.11 support
<http://spark.apache.org/docs/latest/building-spark.html#building-for-scala-211>.*
So if you want one simple deployable, for a standalone environment I
thought you had to perform the make-distribution like I described.

Clearly the individual artifacts are there as you state, is there a
provided 2.11 tgz available as well?  I did not think there was, if there
is then should the documentation on the download site be changed to reflect
this?

Sorry for the confusion.

-Todd

On Sun, Oct 25, 2015 at 4:07 PM, Sean Owen <so...@cloudera.com> wrote:

> No, 2.11 artifacts are in fact published:
> http://search.maven.org/#search%7Cga%7C1%7Ca%3A%22spark-parent_2.11%22
>
> On Sun, Oct 25, 2015 at 7:37 PM, Todd Nist <tsind...@gmail.com> wrote:
> > Sorry Sean you are absolutely right it supports 2.11 all o meant is
> there is
> > no release available as a standard download and that one has to build it.
> > Thanks for the clairification.
> > -Todd
> >
>

Re: Newbie Help for spark compilation problem

2015-10-25 Thread Todd Nist

Hi Bilnmek,

Spark 1.5.x does not support Scala 2.11.7 so the easiest thing to do it
build it like your trying.  Here are the steps I followed to build it on a
Max OS X 10.10.5 environment, should be very similar on ubuntu.

1.  set theJAVA_HOME environment variable in my bash session via export
JAVA_HOME=$(/usr/libexec/java_home).
2. Spark is easiest to build with Maven so insure maven is installed, I
installed 3.3.x.
3.  Download the source form Spark's site and extract.
4.  Change into the spark-1.5.1 folder and run:
   ./dev/change-scala-version.sh 2.11
5.  Issue the following command to build and create a distribution;

./make-distribution.sh --name hadoop-2.6_scala-2.11 --tgz -Pyarn
-Phadoop-2.6 -Dhadoop.version=2.6.0 -Dscala-2.11 -DskipTests

This will provide you with a a fully self-contained installation of Spark
for Scala 2.11 including scripts and the like.  There are some limitations
see this,
http://spark.apache.org/docs/latest/building-spark.html#building-for-scala-211,
for what is not supported.

HTH,

-Todd

On Sun, Oct 25, 2015 at 10:56 AM, Bilinmek Istemiyor 
wrote:

>
> I am just starting out apache spark. I hava zero knowledge about the spark
> environment, scala and sbt. I have a built problems which I could not
> solve. Any help much appreciated.
>
> I am using kubuntu 14.04, java "1.7.0_80, scala 2.11.7 and spark 1.5.1
>
> I tried to compile spark from source an and receive following errors
>
> [0m[[31merror[0m] [0mimpossible to get artifacts when data has not been
> loaded. IvyNode = org.scala-lang#scala-library;2.10.3[0m
> [0m[[31merror[0m] [0m(hive/*:[31mupdate[0m)
> java.lang.IllegalStateException: impossible to get artifacts when data has
> not been loaded. IvyNode = org.scala-lang#scala-library;2.10.3[0m
> [0m[[31merror[0m] [0m(streaming-flume-sink/avro:[31mgenerate[0m)
> org.apache.avro.SchemaParseException: Undefined name: "strıng"[0m
> [0m[[31merror[0m] [0m(streaming-kafka-assembly/*:[31massembly[0m)
> java.util.zip.ZipException: duplicate entry: META-INF/MANIFEST.MF[0m
> [0m[[31merror[0m] [0m(streaming-mqtt/test:[31massembly[0m)
> java.util.zip.ZipException: duplicate entry: META-INF/MANIFEST.MF[0m
> [0m[[31merror[0m] [0m(assembly/*:[31massembly[0m)
> java.util.zip.ZipException: duplicate entry: META-INF/MANIFEST.MF[0m
> [0m[[31merror[0m] [0m(streaming-mqtt-assembly/*:[31massembly[0m)
> java.util.zip.ZipException: duplicate entry: META-INF/MANIFEST.MF[0m
> [0m[[31merror[0m] [0mTotal time: 1128 s, completed 25.Eki.2015 11:00:52[0m
>
> Sorry about some strange characters. I tried to capture the output with
>
> sbt clean assembly 2>&1 | tee compile.txt
>
> compile.txt was full of these characters.  I have attached the output of
> full compile process "compile.txt".
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

Re: Newbie Help for spark compilation problem

2015-10-25 Thread Todd Nist

Sorry Sean you are absolutely right it supports 2.11 all o meant is there
is no release available as a standard download and that one has to build
it.  Thanks for the clairification.
-Todd

On Sunday, October 25, 2015, Sean Owen <so...@cloudera.com> wrote:

> Hm, why do you say it doesn't support 2.11? It does.
>
> It is not even this difficult; you just need a source distribution,
> and then run "./dev/change-scala-version.sh 2.11" as you say. Then
> build as normal
>
> On Sun, Oct 25, 2015 at 4:00 PM, Todd Nist <tsind...@gmail.com
> <javascript:;>> wrote:
> > Hi Bilnmek,
> >
> > Spark 1.5.x does not support Scala 2.11.7 so the easiest thing to do it
> > build it like your trying.  Here are the steps I followed to build it on
> a
> > Max OS X 10.10.5 environment, should be very similar on ubuntu.
> >
> > 1.  set theJAVA_HOME environment variable in my bash session via export
> > JAVA_HOME=$(/usr/libexec/java_home).
> > 2. Spark is easiest to build with Maven so insure maven is installed, I
> > installed 3.3.x.
> > 3.  Download the source form Spark's site and extract.
> > 4.  Change into the spark-1.5.1 folder and run:
> >./dev/change-scala-version.sh 2.11
> > 5.  Issue the following command to build and create a distribution;
> >
> > ./make-distribution.sh --name hadoop-2.6_scala-2.11 --tgz -Pyarn
> > -Phadoop-2.6 -Dhadoop.version=2.6.0 -Dscala-2.11 -DskipTests
> >
> > This will provide you with a a fully self-contained installation of Spark
> > for Scala 2.11 including scripts and the like.  There are some
> limitations
> > see this,
> >
> http://spark.apache.org/docs/latest/building-spark.html#building-for-scala-211
> ,
> > for what is not supported.
> >
> > HTH,
> >
> > -Todd
> >
> >
> > On Sun, Oct 25, 2015 at 10:56 AM, Bilinmek Istemiyor <
> benibi...@gmail.com <javascript:;>>
> > wrote:
> >>
> >>
> >> I am just starting out apache spark. I hava zero knowledge about the
> spark
> >> environment, scala and sbt. I have a built problems which I could not
> solve.
> >> Any help much appreciated.
> >>
> >> I am using kubuntu 14.04, java "1.7.0_80, scala 2.11.7 and spark 1.5.1
> >>
> >> I tried to compile spark from source an and receive following errors
> >>
> >> [0m[[31merror[0m] [0mimpossible to get artifacts when data has not been
> >> loaded. IvyNode = org.scala-lang#scala-library;2.10.3[0m
> >> [0m[[31merror[0m] [0m(hive/*:[31mupdate[0m)
> >> java.lang.IllegalStateException: impossible to get artifacts when data
> has
> >> not been loaded. IvyNode = org.scala-lang#scala-library;2.10.3[0m
> >> [0m[[31merror[0m] [0m(streaming-flume-sink/avro:[31mgenerate[0m)
> >> org.apache.avro.SchemaParseException: Undefined name: "strıng"[0m
> >> [0m[[31merror[0m] [0m(streaming-kafka-assembly/*:[31massembly[0m)
> >> java.util.zip.ZipException: duplicate entry: META-INF/MANIFEST.MF[0m
> >> [0m[[31merror[0m] [0m(streaming-mqtt/test:[31massembly[0m)
> >> java.util.zip.ZipException: duplicate entry: META-INF/MANIFEST.MF[0m
> >> [0m[[31merror[0m] [0m(assembly/*:[31massembly[0m)
> >> java.util.zip.ZipException: duplicate entry: META-INF/MANIFEST.MF[0m
> >> [0m[[31merror[0m] [0m(streaming-mqtt-assembly/*:[31massembly[0m)
> >> java.util.zip.ZipException: duplicate entry: META-INF/MANIFEST.MF[0m
> >> [0m[[31merror[0m] [0mTotal time: 1128 s, completed 25.Eki.2015
> 11:00:52[0m
> >>
> >> Sorry about some strange characters. I tried to capture the output with
> >>
> >> sbt clean assembly 2>&1 | tee compile.txt
> >>
> >> compile.txt was full of these characters.  I have attached the output of
> >> full compile process "compile.txt".
> >>
> >>
> >> -
> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> <javascript:;>
> >> For additional commands, e-mail: user-h...@spark.apache.org
> <javascript:;>
> >
> >
>

Re: java.lang.NegativeArraySizeException? as iterating a big RDD

2015-10-23 Thread Todd Nist

Hi Yifan,

You could also try increasing the spark.kryoserializer.buffer.max.mb

*spark.kryoserializer.buffer.max.mb *(64 Mb by default) : useful if your
default buffer size goes further than 64 Mb;

Per doc:
Maximum allowable size of Kryo serialization buffer. This must be larger
than any object you attempt to serialize. Increase this if you get a
"buffer limit exceeded" exception inside Kryo.

-Todd

On Fri, Oct 23, 2015 at 6:51 AM, Yifan LI  wrote:

> Thanks for your advice, Jem. :)
>
> I will increase the partitioning and see if it helps.
>
> Best,
> Yifan LI
>
>
>
>
>
> On 23 Oct 2015, at 12:48, Jem Tucker  wrote:
>
> Hi Yifan,
>
> I think this is a result of Kryo trying to seriallize something too large.
> Have you tried to increase your partitioning?
>
> Cheers,
>
> Jem
>
> On Fri, Oct 23, 2015 at 11:24 AM Yifan LI  wrote:
>
>> Hi,
>>
>> I have a big sorted RDD sRdd(~962million elements), and need to scan its
>> elements in order(using sRdd.toLocalIterator).
>>
>> But the process failed when the scanning was done after around 893million
>> elements, returned with following exception:
>>
>> Anyone has idea? Thanks!
>>
>>
>> Exception in thread "main" org.apache.spark.SparkException: Job aborted
>> due to stage failure: Task 0 in stage 421752.0 failed 128 times, most
>> recent failure: Lost task 0.127 in stage 421752.0 (TID 17304,
>> small15-tap1.common.lip6.fr): java.lang.NegativeArraySizeException
>> at
>> com.esotericsoftware.kryo.util.IdentityObjectIntMap.resize(IdentityObjectIntMap.java:409)
>> at
>> com.esotericsoftware.kryo.util.IdentityObjectIntMap.putStash(IdentityObjectIntMap.java:227)
>> at
>> com.esotericsoftware.kryo.util.IdentityObjectIntMap.push(IdentityObjectIntMap.java:221)
>> at
>> com.esotericsoftware.kryo.util.IdentityObjectIntMap.put(IdentityObjectIntMap.java:117)
>> at
>> com.esotericsoftware.kryo.util.IdentityObjectIntMap.putStash(IdentityObjectIntMap.java:228)
>> at
>> com.esotericsoftware.kryo.util.IdentityObjectIntMap.push(IdentityObjectIntMap.java:221)
>> at
>> com.esotericsoftware.kryo.util.IdentityObjectIntMap.put(IdentityObjectIntMap.java:117)
>> at
>> com.esotericsoftware.kryo.util.MapReferenceResolver.addWrittenObject(MapReferenceResolver.java:23)
>> at com.esotericsoftware.kryo.Kryo.writeReferenceOrNull(Kryo.java:598)
>> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:566)
>> at com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:36)
>> at com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:33)
>> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
>> at
>> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:318)
>> at
>> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:293)
>> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
>> at
>> org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:250)
>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:236)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> at java.lang.Thread.run(Thread.java:745)
>>
>> Driver stacktrace:
>> at org.apache.spark.scheduler.DAGScheduler.org
>> 
>> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263)
>> at
>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>> at
>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
>> at scala.Option.foreach(Option.scala:236)
>> at
>> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
>> at
>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1457)
>> at
>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418)
>> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>>
>> Best,
>> Yifan LI
>>
>>
>>
>>
>>
>>
>

Re: Spark SQL Thriftserver and Hive UDF in Production

2015-10-19 Thread Todd Nist

>From tableau, you should be able to use the Initial SQL option to support
this:

So in Tableau add the following to the “Initial SQL”

create function myfunc AS 'myclass'
using jar 'hdfs:///path/to/jar';



HTH,
Todd


On Mon, Oct 19, 2015 at 11:22 AM, Deenar Toraskar  wrote:

> Reece
>
> You can do the following. Start the spark-shell. Register the UDFs in the
> shell using sqlContext, then start the Thrift Server using startWithContext
> from the spark shell:
> https://github.com/apache/spark/blob/master/sql/hive-thriftserver
> /src/main/scala/org/apache/spark/sql/hive/thriftserver
> /HiveThriftServer2.scala#L56
>
>
>
> Regards
> Deenar
>
> On 19 October 2015 at 04:42, Mohammed Guller 
> wrote:
>
>> Have you tried registering the function using the Beeline client?
>>
>> Another alternative would be to create a Spark SQL UDF and launch the
>> Spark SQL Thrift server programmatically.
>>
>> Mohammed
>>
>> -Original Message-
>> From: ReeceRobinson [mailto:re...@therobinsons.gen.nz]
>> Sent: Sunday, October 18, 2015 8:05 PM
>> To: user@spark.apache.org
>> Subject: Spark SQL Thriftserver and Hive UDF in Production
>>
>> Does anyone have some advice on the best way to deploy a Hive UDF for use
>> with a Spark SQL Thriftserver where the client is Tableau using Simba ODBC
>> Spark SQL driver.
>>
>> I have seen the hive documentation that provides an example of creating
>> the function using a hive client ie: CREATE FUNCTION myfunc AS 'myclass'
>> USING JAR 'hdfs:///path/to/jar';
>>
>> However using Tableau I can't run this create function statement to
>> register my UDF. Ideally there is a configuration setting that will load my
>> UDF jar and register it at start-up of the thriftserver.
>>
>> Can anyone tell me what the best option if it is possible?
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Thriftserver-and-Hive-UDF-in-Production-tp25114.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
>> commands, e-mail: user-h...@spark.apache.org
>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Re: KafkaProducer using Cassandra as source

2015-09-23 Thread Todd Nist

Hi Kali,

If you do not mind sending JSON, you could do something like this, using
json4s:


val rows = p.collect() map ( row => TestTable(row.getString(0),
row.getString(1)) )

val json = parse(write(rows))

producer.send(new KeyedMessage[String, String]("trade", writePretty(json)))

// or for each individual entry
for( row <- rows) {
  producer.send(new KeyedMessage[String, String]("trade",
writePretty(parse(write(row)
}

Just make sure you import the following:

import org.json4s._
import org.json4s.jackson.JsonMethods._
import org.json4s.native.Serialization
import org.json4s.native.Serialization.{ read, write, writePretty }


On Wed, Sep 23, 2015 at 12:26 PM, kali.tumm...@gmail.com <
kali.tumm...@gmail.com> wrote:

> Guys sorry I figured it out.
>
> val
>
> x=p.collect().mkString("\n").replace("[","").replace("]","").replace(",","~")
>
> Full Code:-
>
> package com.examples
>
> /**
>  * Created by kalit_000 on 22/09/2015.
>  */
>
> import kafka.producer.KeyedMessage
> import kafka.producer.Producer
> import kafka.producer.ProducerConfig
> import java.util.Properties
> import _root_.kafka.serializer.StringDecoder
> import org.apache.spark._
> import org.apache.spark.SparkContext._
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.SparkConf
> import org.apache.log4j.Logger
> import org.apache.log4j.Level
> import org.apache.spark.streaming._
> import org.apache.spark.streaming.{Seconds,StreamingContext}
> import org.apache.spark._
> import org.apache.spark.streaming.StreamingContext._
> import org.apache.spark.streaming.kafka.KafkaUtils
>
> object SparkProducerDBCassandra {
>
>   case class TestTable (TRADE_ID:String,TRADE_PRICE: String)
>
>   def main(args: Array[String]): Unit =
>   {
> Logger.getLogger("org").setLevel(Level.WARN)
> Logger.getLogger("akka").setLevel(Level.WARN)
>
> val conf = new
>
> SparkConf().setMaster("local[2]").setAppName("testkali2").set("spark.cassandra.connection.host",
> "127.0.0.1")
> val sc=new SparkContext("local","test",conf)
> //val ssc= new StreamingContext(sc,Seconds(2))
>
> print("Test kali Spark Cassandra")
>
> val cc = new org.apache.spark.sql.cassandra.CassandraSQLContext(sc)
>
> val p=cc.sql("select * from people.person")
>
> p.collect().foreach(println)
>
> val props:Properties = new Properties()
> props.put("metadata.broker.list", "localhost:9092")
> props.put("serializer.class", "kafka.serializer.StringEncoder")
>
> val config= new ProducerConfig(props)
> val producer= new Producer[String,String](config)
>
> val
>
> x=p.collect().mkString("\n").replace("[","").replace("]","").replace(",","~")
>
>producer.send(new KeyedMessage[String, String]("trade", x))
>
> //p.collect().foreach(print)
>
> //ssc.start()
>
> //ssc.awaitTermination()
>
>   }
> }
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/KafkaProducer-using-Cassandra-as-source-tp24774p24788.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Replacing Esper with Spark Streaming?

2015-09-14 Thread Todd Nist

Stratio offers a CEP implementation based on Spark Streaming and the Siddhi
CEP engine.  I have not used the below, but they may be of some value to
you:

http://stratio.github.io/streaming-cep-engine/

https://github.com/Stratio/streaming-cep-engine

HTH.

-Todd

On Sun, Sep 13, 2015 at 7:49 PM, Otis Gospodnetić <
otis.gospodne...@gmail.com> wrote:

> Hi,
>
> I'm wondering if anyone has attempted to replace Esper with Spark
> Streaming or if anyone thinks Spark Streaming is/isn't a good tool for the
> (CEP) job?
>
> We are considering Akka or Spark Streaming as possible Esper replacements
> and would appreciate any input from people who tried to do that with either
> of them.
>
> Thanks,
> Otis
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>
>

Re: Tungsten and Spark Streaming

2015-09-10 Thread Todd Nist

https://issues.apache.org/jira/browse/SPARK-8360?jql=project%20%3D%20SPARK%20AND%20text%20~%20Streaming

-Todd

On Thu, Sep 10, 2015 at 10:22 AM, Gurvinder Singh <
gurvinder.si...@uninett.no> wrote:

> On 09/10/2015 07:42 AM, Tathagata Das wrote:
> > Rewriting is necessary. You will have to convert RDD/DStream operations
> > to DataFrame operations. So get the RDDs in DStream, using
> > transform/foreachRDD, convert to DataFrames and then do DataFrame
> > operations.
>
> Are there any plans for 1.6 or later to add support of tungsten to
> RDD/DStream directly or it is intended that users should switch to
> dataframe rather then operating on RDD/Dstream level.
>
> >
> > On Wed, Sep 9, 2015 at 9:23 PM, N B  > > wrote:
> >
> > Hello,
> >
> > How can we start taking advantage of the performance gains made
> > under Project Tungsten in Spark 1.5 for a Spark Streaming program?
> >
> > From what I understand, this is available by default for Dataframes.
> > But for a program written using Spark Streaming, would we see any
> > potential gains "out of the box" in 1.5 or will we have to rewrite
> > some portions of the application code to realize that benefit?
> >
> > Any insight/documentation links etc in this regard will be
> appreciated.
> >
> > Thanks
> > Nikunj
> >
> >
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Starting Spark SQL thrift server from within a streaming app

2015-08-06 Thread Todd Nist

Well the creation of a thrift server would be to allow external access to
the data from JDBC / ODBC type connections.  The sparkstreaming-sql
leverages a standard spark sql context and then provides a means of
converting an incoming dstream into a row, look at the MessageToRow trait
in KafkaSource class.

The example, org.apache.spark.sql.streaming.examples.KafkaDDL should make
it clear; I think.

-Todd

On Thu, Aug 6, 2015 at 7:58 AM, Daniel Haviv 
daniel.ha...@veracity-group.com wrote:

 Thank you Todd,
 How is the sparkstreaming-sql project different from starting a thrift
 server on a streaming app ?

 Thanks again.
 Daniel


 On Thu, Aug 6, 2015 at 1:53 AM, Todd Nist tsind...@gmail.com wrote:

 Hi Danniel,

 It is possible to create an instance of the SparkSQL Thrift server,
 however seems like this project is what you may be looking for:

 https://github.com/Intel-bigdata/spark-streamingsql

 Not 100% sure of your use case is, but you can always convert the data
 into DF then issue a query against it.  If you want other systems to be
 able to query it then there are numerous connectors to  store data into
 Hive, Cassandra, HBase, ElasticSearch, 

 To create a instance of a thrift server with its own SQL Context you
 would do something like the following:

 import org.apache.spark.{SparkConf, SparkContext}

 import org.apache.spark.sql.hive.HiveContext
 import org.apache.spark.sql.hive.HiveMetastoreTypes._
 import org.apache.spark.sql.types._
 import org.apache.spark.sql.hive.thriftserver._


 object MyThriftServer {

   val sparkConf = new SparkConf()
 // master is passed to spark-submit, but could also be specified 
 explicitely
 // .setMaster(sparkMaster)
 .setAppName(My ThriftServer)
 .set(spark.cores.max, 2)
   val sc = new SparkContext(sparkConf)
   val  sparkContext  =  sc
   import  sparkContext._
   val  sqlContext  =  new  HiveContext(sparkContext)
   import  sqlContext._
   import sqlContext.implicits._

   makeRDD((1,hello) :: (2,world) 
 ::Nil).toDF.cache().registerTempTable(t)

   HiveThriftServer2.startWithContext(sqlContext)
 }

 Again, I'm not really clear what your use case is, but it does sound like
 the first link above is what you may want.

 -Todd

 On Wed, Aug 5, 2015 at 1:57 PM, Daniel Haviv 
 daniel.ha...@veracity-group.com wrote:

 Hi,
 Is it possible to start the Spark SQL thrift server from with a
 streaming app so the streamed data could be queried as it's goes in ?

 Thank you.
 Daniel

Re: How can I know currently supported functions in Spark SQL

2015-08-06 Thread Todd Nist

They are covered here in the docs:

http://spark.apache.org/docs/1.4.1/api/scala/index.html#org.apache.spark.sql.functions$


On Thu, Aug 6, 2015 at 5:52 AM, Netwaver wanglong_...@163.com wrote:

 Hi All,
  I am using Spark 1.4.1, and I want to know how can I find the
 complete function list supported in Spark SQL, currently I only know
 'sum','count','min','max'. Thanks a lot.

Re: Starting Spark SQL thrift server from within a streaming app

2015-08-05 Thread Todd Nist

Hi Danniel,

It is possible to create an instance of the SparkSQL Thrift server, however
seems like this project is what you may be looking for:

https://github.com/Intel-bigdata/spark-streamingsql

Not 100% sure of your use case is, but you can always convert the data into
DF then issue a query against it.  If you want other systems to be able to
query it then there are numerous connectors to  store data into Hive,
Cassandra, HBase, ElasticSearch, 

To create a instance of a thrift server with its own SQL Context you would
do something like the following:

import org.apache.spark.{SparkConf, SparkContext}

import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.hive.HiveMetastoreTypes._
import org.apache.spark.sql.types._
import org.apache.spark.sql.hive.thriftserver._


object MyThriftServer {

  val sparkConf = new SparkConf()
// master is passed to spark-submit, but could also be specified explicitely
// .setMaster(sparkMaster)
.setAppName(My ThriftServer)
.set(spark.cores.max, 2)
  val sc = new SparkContext(sparkConf)
  val  sparkContext  =  sc
  import  sparkContext._
  val  sqlContext  =  new  HiveContext(sparkContext)
  import  sqlContext._
  import sqlContext.implicits._

  makeRDD((1,hello) :: (2,world) ::Nil).toDF.cache().registerTempTable(t)

  HiveThriftServer2.startWithContext(sqlContext)
}

Again, I'm not really clear what your use case is, but it does sound like
the first link above is what you may want.

-Todd

On Wed, Aug 5, 2015 at 1:57 PM, Daniel Haviv 
daniel.ha...@veracity-group.com wrote:

 Hi,
 Is it possible to start the Spark SQL thrift server from with a streaming
 app so the streamed data could be queried as it's goes in ?

 Thank you.
 Daniel

Re: Does Spark streaming support is there with RabbitMQ

2015-07-20 Thread Todd Nist

There is one package available on the spark-packages site,

http://spark-packages.org/package/Stratio/RabbitMQ-Receiver

The source is here:

https://github.com/Stratio/RabbitMQ-Receiver

Not sure that meets your needs or not.

-Todd

On Mon, Jul 20, 2015 at 8:52 AM, Jeetendra Gangele gangele...@gmail.com
wrote:

 Does Apache spark support RabbitMQ. I have messages on RabbitMQ and I want
 to process them using Apache Spark streaming does it scale?

 Regards
 Jeetendra

Re: Use rank with distribute by in HiveContext

2015-07-16 Thread Todd Nist

Did you take a look at the excellent write up by Yin Huai and Michael
Armbrust?  It appears that rank is supported in the 1.4.x release.

https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html

Snippet from above article for your convenience:

To answer the first question “*What are the best-selling and the second
best-selling products in every category?*”, we need to rank products in a
category based on their revenue, and to pick the best selling and the
second best-selling products based the ranking. Below is the SQL query used
to answer this question by using window function dense_rank (we will
explain the syntax of using window functions in next section).

SELECT
  product,
  category,
  revenueFROM (
  SELECT
product,
category,
revenue,
dense_rank() OVER (PARTITION BY category ORDER BY revenue DESC) as rank
  FROM productRevenue) tmpWHERE
  rank = 2



The result of this query is shown below. Without using window functions, it
is very hard to express the query in SQL, and even if a SQL query can be
expressed, it is hard for the underlying engine to efficiently evaluate the
query.

[image: 1-2]


SQLDataFrame APIRanking functionsrankrankdense_rankdenseRankpercent_rank
percentRankntilentilerow_numberrowNumber

 HTH.

-Todd

On Thu, Jul 16, 2015 at 8:10 AM, Lior Chaga lio...@taboola.com wrote:

 Does spark HiveContext support the rank() ... distribute by syntax (as in
 the following article-
 http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/doing_rank_with_hive
 )?

 If not, how can it be achieved?

 Thanks,
 Lior

Re: spark streaming job to hbase write

2015-07-15 Thread Todd Nist

There are there connector packages listed on spark packages web site:

http://spark-packages.org/?q=hbase

HTH.

-Todd

On Wed, Jul 15, 2015 at 2:46 PM, Shushant Arora shushantaror...@gmail.com
wrote:

 Hi

 I have a requirement of writing in hbase table from Spark streaming app
 after some processing.
 Is Hbase put operation the only way of writing to hbase or is there any
 specialised connector or rdd of spark for hbase write.

 Should Bulk load to hbase from streaming  app be avoided if output of each
 batch interval is just few mbs?

 Thanks

Re: Saving RDD into cassandra keyspace.

2015-07-10 Thread Todd Nist

I would strongly encourage you to read the docs at, they are very useful in
getting up and running:

https://github.com/datastax/spark-cassandra-connector/blob/master/doc/0_quick_start.md

For your use case shown above, you will need to ensure that you include the
appropriate version of the spark-cassandra-connectore assembly jar when you
submit the job.  The version you use should correspond to the version of
Spark you are running.   In addition, you will want to ensure that you set
the spark.cassandra.connection.host as shown below, prior to creating the
SparkContext.

val conf = new SparkConf(true)
   .set(spark.cassandra.connection.host, 127.0.0.1)


HTH

-Todd


On Fri, Jul 10, 2015 at 5:24 AM, Prateek . prat...@aricent.com wrote:

  Hi,



 I am beginner to spark , I want save the word and its count to cassandra
 keyspace, I wrote the following code



 import org.apache.spark.SparkContext

 import org.apache.spark.SparkContext._

 import org.apache.spark.SparkConf

 import com.datastax.spark.connector._



 object SparkWordCount {

   def main(args: Array[String]) {

 val sc = new SparkContext(new SparkConf().setAppName(Spark Count))

 val tokenized = sc.textFile(args(0)).flatMap(_.split( ))

 val wordCounts = tokenized.map((_, 1)).reduceByKey(_ + _)

 wordCounts.saveToCassandra(sparkdata, words, SomeColumns(word,
 count));



   }

 and did spark-submit. The code doesn’t work ( may be some very basic error
 because I am new to it).I know there is datastax cassandra connector but
 how to make connection?

 What all things I am missing in my code?



 Thanks










  DISCLAIMER: This message is proprietary to Aricent and is intended
 solely for the use of the individual to whom it is addressed. It may
 contain privileged or confidential information and should not be circulated
 or used for any purpose other than for what it is intended. If you have
 received this message in error, please notify the originator immediately.
 If you are not the intended recipient, you are notified that you are
 strictly prohibited from using, copying, altering, or disclosing the
 contents of this message. Aricent accepts no responsibility for loss or
 damage arising from the use of the information transmitted by this email
 including damage from virus.

Re: [X-post] Saving SparkSQL result RDD to Cassandra

2015-07-09 Thread Todd Nist

foreachRDD returns a unit:

def foreachRDD(foreachFunc: (RDD
https://spark.apache.org/docs/latest/api/scala/org/apache/spark/rdd/RDD.html
[T]) ⇒ Unit): Unit

Apply a function to each RDD in this DStream. This is an output operator,
so 'this' DStream will be registered as an output stream and therefore
materialized.

Change it to a map, foreach or some other form of transform.

HTH

-Todd


On Thu, Jul 9, 2015 at 5:24 PM, Su She suhsheka...@gmail.com wrote:

 Hello All,

 I also posted this on the Spark/Datastax thread, but thought it was also
 50% a spark question (or mostly a spark question).

 I was wondering what is the best practice to saving streaming Spark SQL (
 https://github.com/Intel-bigdata/spark-streamingsql/blob/master/src/main/scala/org/apache/spark/sql/streaming/examples/KafkaDDL.scala)
 results to Cassandra?

 The query looks like this:

  streamSqlContext.sql(
   
 |SELECT t.word, COUNT(t.word)
 |FROM (SELECT * FROM t_kafka) OVER (WINDOW '9' SECONDS, SLIDE '3'
 SECONDS) AS t
 |GROUP BY t.word
   .stripMargin)
   .foreachRDD { r = r.toString()}.map(x =
 x.split(,)).map(x=data(x(0),x(1))).saveToCassandra(demo, sqltest)

 I’m getting a message saying map isn’t a member of Unit.

 I thought since I'm converting it to a string I can call a map/save to
 Cassandra function there, but it seems like I can't call map after
 r.toString()?

 Please let me know if this is possible and what is the best way of doing
 this. Thank you for the help!

 -Su

Re: Setting JVM heap start and max sizes, -Xms and -Xmx, for executors

2015-07-02 Thread Todd Nist

Yes, that does appear to be the case.  The documentation is very clear
about the heap settings and that they can not be used with
spark.executor.extraJavaOptions

spark.executor.extraJavaOptions(none)A string of extra JVM options to pass
to executors. For instance, GC settings or other logging. *Note that it is
illegal to set Spark properties or heap size settings with this option.*
Spark properties should be set using a SparkConf object or the
spark-defaults.conf file used with the spark-submit script. *Heap size
settings can be set with spark.executor.memory*.
So it appears to be a limitation at this time.

-Todd



On Thu, Jul 2, 2015 at 4:13 PM, Mulugeta Mammo mulugeta.abe...@gmail.com
wrote:

 thanks but my use case requires I specify different start and max heap
 sizes. Looks like spark sets start and max sizes  same value.

 On Thu, Jul 2, 2015 at 1:08 PM, Todd Nist tsind...@gmail.com wrote:

 You should use:

 spark.executor.memory

 from the docs https://spark.apache.org/docs/latest/configuration.html:
 spark.executor.memory512mAmount of memory to use per executor process,
 in the same format as JVM memory strings (e.g.512m, 2g).

 -Todd



 On Thu, Jul 2, 2015 at 3:36 PM, Mulugeta Mammo mulugeta.abe...@gmail.com
  wrote:

 tried that one and it throws error - extraJavaOptions is not allowed to
 alter memory settings, use spakr.executor.memory instead.

 On Thu, Jul 2, 2015 at 12:21 PM, Benjamin Fradet 
 benjamin.fra...@gmail.com wrote:

 Hi,

 You can set those parameters through the

 spark.executor.extraJavaOptions

 Which is documented in the configuration guide:
 spark.apache.org/docs/latest/configuration.htnl
 On 2 Jul 2015 9:06 pm, Mulugeta Mammo mulugeta.abe...@gmail.com
 wrote:

 Hi,

 I'm running Spark 1.4.0, I want to specify the start and max size
 (-Xms and Xmx) of the jvm heap size for my executors, I tried:

 executor.cores.memory=-Xms1g -Xms8g

 but doesn't work. How do I specify?

 Appreciate your help.

 Thanks,

Re: Setting JVM heap start and max sizes, -Xms and -Xmx, for executors

2015-07-02 Thread Todd Nist

You should use:

spark.executor.memory

from the docs https://spark.apache.org/docs/latest/configuration.html:
spark.executor.memory512mAmount of memory to use per executor process, in
the same format as JVM memory strings (e.g.512m, 2g).

-Todd



On Thu, Jul 2, 2015 at 3:36 PM, Mulugeta Mammo mulugeta.abe...@gmail.com
wrote:

 tried that one and it throws error - extraJavaOptions is not allowed to
 alter memory settings, use spakr.executor.memory instead.

 On Thu, Jul 2, 2015 at 12:21 PM, Benjamin Fradet 
 benjamin.fra...@gmail.com wrote:

 Hi,

 You can set those parameters through the

 spark.executor.extraJavaOptions

 Which is documented in the configuration guide:
 spark.apache.org/docs/latest/configuration.htnl
 On 2 Jul 2015 9:06 pm, Mulugeta Mammo mulugeta.abe...@gmail.com
 wrote:

 Hi,

 I'm running Spark 1.4.0, I want to specify the start and max size (-Xms
 and Xmx) of the jvm heap size for my executors, I tried:

 executor.cores.memory=-Xms1g -Xms8g

 but doesn't work. How do I specify?

 Appreciate your help.

 Thanks,

Re: Spark 1.4 on HortonWork HDP 2.2

2015-06-19 Thread Todd Nist

You can get HDP with at least 1.3.1 from Horton:

http://hortonworks.com/hadoop-tutorial/using-apache-spark-technical-preview-with-hdp-2-2/

for  your convenience from the dos:

wget -nv 
http://public-repo-1.hortonworks.com/HDP/centos6/2.x/updates/2.2.4.4/hdp.repo
-O /etc/yum.repos.d/HDP-TP.repo


and then install:

yum install spark_2_2_4_4_16-master






On Fri, Jun 19, 2015 at 12:01 PM, Doug Balog doug.sparku...@dugos.com
wrote:

 If you run Hadoop in secure mode and want to talk to Hive 0.14, it won’t
 work, see SPARK-5111
 I have a patched version of 1.3.1 that I’ve been using.
 I haven’t had the time to get 1.4.0 working.

 Cheers,

 Doug



  On Jun 19, 2015, at 8:39 AM, ayan guha guha.a...@gmail.com wrote:
 
  I think you can get spark 1.4 pre built with hadoop 2.6 (as that what
 hdp 2.2 provides) and just start using it
 
  On Fri, Jun 19, 2015 at 10:28 PM, Ashish Soni asoni.le...@gmail.com
 wrote:
  I do not where to start  as Spark 1.2 comes bundled with HDP2.2 but i
 want to use 1.4 and i do not know how to update it to 1.4
 
  Ashish
 
  On Fri, Jun 19, 2015 at 8:26 AM, ayan guha guha.a...@gmail.com wrote:
  what problem are you facing? are you trying to build it yurself or
 gettingpre-built version?
 
  On Fri, Jun 19, 2015 at 10:22 PM, Ashish Soni asoni.le...@gmail.com
 wrote:
  Hi ,
 
  Is any one able to install Spark 1.4 on HDP 2.2 , Please let me know how
 can i do the same ?
 
  Ashish
 
 
 
  --
  Best Regards,
  Ayan Guha
 
 
 
 
  --
  Best Regards,
  Ayan Guha


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark DataFrame Reduce Job Took 40s for 6000 Rows

2015-06-15 Thread Todd Nist

Hi Proust,

Is it possible to see the query  you are running and can you run EXPLAIN
EXTENDED to show the physical plan
for the query.  To generate the plan you can do something like this from
$SPARK_HOME/bin/beeline:

0: jdbc:hive2://localhost:10001 explain extended select * from
YourTableHere;

-Todd

On Mon, Jun 15, 2015 at 10:57 AM, Proust GZ Feng pf...@cn.ibm.com wrote:

 Thanks a lot Akhil, after try some suggestions in the tuning guide, there
 seems no improvement at all.

 And below is the job detail when running locally(8cores) which took 3min
 to complete the job, we can see it is the map operation took most of time,
 looks like the mapPartitions took too long

 Is there any additional idea? Thanks a lot.

 Proust




 From:Akhil Das ak...@sigmoidanalytics.com
 To:Proust GZ Feng/China/IBM@IBMCN
 Cc:user@spark.apache.org user@spark.apache.org
 Date:06/15/2015 03:02 PM
 Subject:Re: Spark DataFrame Reduce Job Took 40s for 6000 Rows
 --



 Have a look here *https://spark.apache.org/docs/latest/tuning.html*
 https://spark.apache.org/docs/latest/tuning.html

 Thanks
 Best Regards

 On Mon, Jun 15, 2015 at 11:27 AM, Proust GZ Feng *pf...@cn.ibm.com*
 pf...@cn.ibm.com wrote:
 Hi, Spark Experts

 I have played with Spark several weeks, after some time testing, a reduce
 operation of DataFrame cost 40s on a cluster with 5 datanode executors.
 And the back-end rows is about 6,000, is this a normal case? Such
 performance looks too bad because in Java a loop for 6,000 rows cause just
 several seconds

 I'm wondering any document I should read to make the job much more fast?




 Thanks in advance
 Proust

Re: Spark 1.4 release date

2015-06-12 Thread Todd Nist

It was released yesterday.

On Friday, June 12, 2015, ayan guha guha.a...@gmail.com wrote:

 Hi

 When is official spark 1.4 release date?
 Best
 Ayan

Re: How to pass arguments dynamically, that needs to be used in executors

2015-06-11 Thread Todd Nist

Hi Gaurav,

Seems like you could use a broadcast variable for this if I understand your
use case.  Create it in the driver based on the CommandLineArguments and
then use it in the workers.

https://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables

So something like:

BroadcastInteger cmdLineArg = sc.broadcast(Inetger.parseInd(args[12]));

Then just reference the broadcast variable in you workers.  It will get
shipped once to all nodes in the cluster and can be referenced by them.

HTH.

-Todd

On Thu, Jun 11, 2015 at 8:23 AM, gaurav sharma sharmagaura...@gmail.com
wrote:

 Hi,

 I am using Kafka Spark cluster for real time aggregation analytics use
 case in production.

 Cluster details
 6 nodes, each node running 1 Spark and kafka processes each.
 Node1  - 1 Master , 1 Worker, 1 Driver,
1 Kafka process
 Node 2,3,4,5,6 - 1 Worker prcocess each
  1 Kafka process each

 Spark version 1.3.0
 Kafka Veriosn 0.8.1

 I am using Kafka Directstream for Kafka Spark Integration.
 Analytics code is written in using Spark Java API.

 Problem Statement :

   I want to accept a paramter as command line argument, and pass on to
 the executors.
   (want to use the paramter in rdd.foreach method which is executed on
 executor)

   I understand that when driver is started, only the jar is
 transported to all the Workers.
   But i need to use the dynamically passed command line argument in
 the reduce operation executed on executors.


 Code Snippets for better understanding my problem :

 public class KafkaReconcilationJob {

 private static Logger logger =
 Logger.getLogger(KafkaReconcilationJob.class);
  public static void main(String[] args) throws Exception {
   CommandLineArguments.CLICK_THRESHOLD =
 Integer.parseInt(args[12]);
 --- I want to use this
 command line argument
 }

 }



 JavaRDDAggregatedAdeStats adeAggregatedFilteredData =
 adeAudGeoAggDataRdd.filter(new FunctionAggregatedAdeStats, Boolean() {
 @Override
 public Boolean call(AggregatedAdeStats adeAggregatedObj) throws Exception {
 if(adeAggregatedObj.getImpr()  CommandLineArguments.IMPR_THRESHOLD ||
 adeAggregatedObj.getClick()  CommandLineArguments.CLICK_THRESHOLD){
 return true;
 }else {
 return false;
 }
 }
 });



 The above mentioned Filter operation gets executed on executor which has 0
 as the value of the static field CommandLineArguments.CLICK_THRESHOLD


 Regards,
 Gaurav

Re: Spark SQL and Streaming Results

2015-06-05 Thread Todd Nist

There use to be a project, StreamSQL (
https://github.com/thunderain-project/StreamSQL), but it appears a bit
dated and I do not see it in the Spark repo, but may have missed it.

@TD Is this project still active?

I'm not sure what the status is but it may provide some insights on how to
achieve what your looking to do.

On Fri, Jun 5, 2015 at 6:34 PM, Tathagata Das t...@databricks.com wrote:

 You could take at RDD *async operations, their source code. May be that
 can help if getting some early results.

 TD

 On Fri, Jun 5, 2015 at 8:41 AM, Pietro Gentile 
 pietro.gentile89.develo...@gmail.com wrote:

 Hi all,


 what is the best way to perform Spark SQL queries and obtain the result
 tuplas in a stremaing way. In particullar, I want to aggregate data and
 obtain the first and incomplete results in a fast way. But it should be
 updated until the aggregation be completed.

 Best Regards.

Re: spark.executor.extraClassPath - Values not picked up by executors

2015-05-23 Thread Todd Nist

Hi Yana,

Yes typeo in the eamil, file name is correct spark-defaults.conf; thanks
though.  So it appears to work if in the driver is specify it as part of
the sparkConf:

val conf = new SparkConf().setAppName(getClass.getSimpleName)
  .set(spark.executor.extraClassPath,
/projects/spark-cassandra-connector/spark-cassandra-connetor/target/scala-2.10/spark-cassandra-connector-assembly-1.3.0-SNAPSHOT.jar
)

I thought the spark-defaults would be applied regardless of weather it was
a spark-submit (driver) or a custom driver as in my case, but apparently I
am mistaken.  This will work fine as I can ensure that all hosts
participating in the cluster have access to a common directory with the
dependencies and then just set the spark.executor.extraClassPath to
/some/shared/directory/lib/*.jar.

If there is a better way to address this, let me know.

As for the spark-cassandra-connector 1.3.0-SNAPSHOT, I am building that
from master.  Haven't hit any issue with it yet.

-Todd

On Fri, May 22, 2015 at 9:39 PM, Yana Kadiyska yana.kadiy...@gmail.com
wrote:

 Todd, I don't have any answers for you...other than the file is actually
 named spark-defaults.conf (not sure if you made a typo in the email or
 misnamed the file...). Do any other options from that file get read?

 I also wanted to ask if you built the spark-cassandra-connector-assembly-
 1.3.0-SNAPSHOT.jar from trunk or if they published a 1.3 drop somewhere
 -- I'm just starting out with Cassandra and discovered
 https://datastax-oss.atlassian.net/browse/SPARKC-98 is still open...

 On Fri, May 22, 2015 at 6:15 PM, Todd Nist tsind...@gmail.com wrote:

 I'm using the spark-cassandra-connector from DataStax in a spark
 streaming job launched from my own driver.  It is connecting a a standalone
 cluster on my local box which has two worker running.

 This is Spark 1.3.1 and spark-cassandra-connector-1.3.0-SNAPSHOT.  I have
 added the following entry to my $SPARK_HOME/conf/spark-default.conf:

 spark.executor.extraClassPath 
 /projects/spark-cassandra-connector/spark-cassandra-connector/target/scala-2.10/spark-cassandra-connector-assembly-1.3.0-SNAPSHOT.jar


 When I start the master with, $SPARK_HOME/sbin/start-master.sh, it comes
 up just fine.  As do the two workers with the following command:

 Worker 1, port 8081:

 radtech:spark $ ./bin/spark-class org.apache.spark.deploy.worker.Worker 
 spark://radtech.io:7077 --webui-port 8081 --cores 2

 Worker 2, port 8082

 radtech:spark $ ./bin/spark-class org.apache.spark.deploy.worker.Worker 
 spark://radtech.io:7077 --webui-port 8082 --cores 2

 When I execute the Driver connecting the the master:

 sbt app/run -Dspark.master=spark://radtech.io:7077

 It starts up, but when the executors are launched they do not include the
 entry in the spark.executor.extraClassPath:

 15/05/22 17:35:26 INFO Worker: Asked to launch executor 
 app-20150522173526-/0 for KillrWeatherApp$15/05/22 17:35:26 INFO 
 ExecutorRunner: Launch command: java -cp 
 /usr/local/spark/conf:/usr/local/spark/lib/spark-assembly-1.3.1-hadoop2.6.0.jar:/usr/local/spark/lib/datanucleus-api-jdo-3.2.6.jar:/usr/local/spark/lib/datanucleus-core-3.2.10.jar:/usr/local/spark/lib/datanucleus-rdbms-3.2.9.jar:/usr/local/spark/conf:/usr/local/spark/lib/spark-assembly-1.3.1-hadoop2.6.0.jar:/usr/local/spark/lib/datanucleus-api-jdo-3.2.6.jar:/usr/local/spark/lib/datanucleus-core-3.2.10.jar:/usr/local/spark/lib/datanucleus-rdbms-3.2.9.jar
  -Dspark.driver.port=55932 -Xms512M -Xmx512M 
 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
 akka.tcp://sparkDriver@192.168.1.3:55932/user/CoarseGrainedScheduler 
 --executor-id 0 --hostname 192.168.1.3 --cores 2 --app-id 
 app-20150522173526- --worker-url 
 akka.tcp://sparkWorker@192.168.1.3:55923/user/Worker



 which will then cause the executor to fail with a ClassNotFoundException,
 which I would expect:

 [WARN] [2015-05-22 17:38:18,035] 
 [org.apache.spark.scheduler.TaskSetManager]: Lost task 0.0 in stage 2.0 (TID 
 23, 192.168.1.3): java.lang.ClassNotFoundException: 
 com.datastax.spark.connector.rdd.partitioner.CassandraPartition
 at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:344)
 at 
 org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:65)
 at 
 java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
 at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
 at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
 at java.io.ObjectInputStream.readObject0

spark.executor.extraClassPath - Values not picked up by executors

2015-05-22 Thread Todd Nist

I'm using the spark-cassandra-connector from DataStax in a spark streaming
job launched from my own driver.  It is connecting a a standalone cluster
on my local box which has two worker running.

This is Spark 1.3.1 and spark-cassandra-connector-1.3.0-SNAPSHOT.  I have
added the following entry to my $SPARK_HOME/conf/spark-default.conf:

spark.executor.extraClassPath
/projects/spark-cassandra-connector/spark-cassandra-connector/target/scala-2.10/spark-cassandra-connector-assembly-1.3.0-SNAPSHOT.jar


When I start the master with, $SPARK_HOME/sbin/start-master.sh, it comes up
just fine.  As do the two workers with the following command:

Worker 1, port 8081:

radtech:spark $ ./bin/spark-class
org.apache.spark.deploy.worker.Worker spark://radtech.io:7077
--webui-port 8081 --cores 2

Worker 2, port 8082

radtech:spark $ ./bin/spark-class
org.apache.spark.deploy.worker.Worker spark://radtech.io:7077
--webui-port 8082 --cores 2

When I execute the Driver connecting the the master:

sbt app/run -Dspark.master=spark://radtech.io:7077

It starts up, but when the executors are launched they do not include the
entry in the spark.executor.extraClassPath:

15/05/22 17:35:26 INFO Worker: Asked to launch executor
app-20150522173526-/0 for KillrWeatherApp$15/05/22 17:35:26 INFO
ExecutorRunner: Launch command: java -cp
/usr/local/spark/conf:/usr/local/spark/lib/spark-assembly-1.3.1-hadoop2.6.0.jar:/usr/local/spark/lib/datanucleus-api-jdo-3.2.6.jar:/usr/local/spark/lib/datanucleus-core-3.2.10.jar:/usr/local/spark/lib/datanucleus-rdbms-3.2.9.jar:/usr/local/spark/conf:/usr/local/spark/lib/spark-assembly-1.3.1-hadoop2.6.0.jar:/usr/local/spark/lib/datanucleus-api-jdo-3.2.6.jar:/usr/local/spark/lib/datanucleus-core-3.2.10.jar:/usr/local/spark/lib/datanucleus-rdbms-3.2.9.jar
-Dspark.driver.port=55932 -Xms512M -Xmx512M
org.apache.spark.executor.CoarseGrainedExecutorBackend
--driver-url 
akka.tcp://sparkDriver@192.168.1.3:55932/user/CoarseGrainedScheduler
--executor-id 0 --hostname 192.168.1.3 --cores 2
--app-id app-20150522173526- --worker-url
akka.tcp://sparkWorker@192.168.1.3:55923/user/Worker



which will then cause the executor to fail with a ClassNotFoundException,
which I would expect:

[WARN] [2015-05-22 17:38:18,035]
[org.apache.spark.scheduler.TaskSetManager]: Lost task 0.0 in stage
2.0 (TID 23, 192.168.1.3): java.lang.ClassNotFoundException:
com.datastax.spark.connector.rdd.partitioner.CassandraPartition
at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:344)
at 
org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:65)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:68)
at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:94)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:185)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

I also notice that some of the entires on the executor classpath are
duplicated?  This is a newly installed spark-1.3.1-bin-hadoop2.6
 standalone cluster just to ensure I had nothing from testing in the way.

I can set the SPARK_CLASSPATH in the $SPARK_HOME/spark-env.sh and it will
pick up the jar and append it fine.

Any suggestions on what is going on here?  Seems to just ignore whatever I
have in the spark.executor.extraClassPath.  Is there a different way to do
this?

TIA.

-Todd

Re: Question about Serialization in Storage Level

2015-05-21 Thread Todd Nist

From the docs,
https://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence:

Storage LevelMeaningMEMORY_ONLYStore RDD as deserialized Java objects in
the JVM. If the RDD does not fit in memory, some partitions will not be
cached and will be recomputed on the fly each time they're needed. This is
the default level.MEMORY_AND_DISKStore RDD as *deserialized* Java objects
in the JVM. If the RDD does not fit in memory, store the partitions that
don't fit on disk, and read them from there when they're needed.
MEMORY_ONLY_SERStore RDD as *serialized* Java objects (one byte array per
partition). This is generally more space-efficient than deserialized
objects, especially when using a fast serializer
https://spark.apache.org/docs/latest/tuning.html, but more CPU-intensive
to read.MEMORY_AND_DISK_SERSimilar to *MEMORY_ONLY_SER*, but spill
partitions that don't fit in memory to disk instead of recomputing them on
the fly each time they're needed.

On Thu, May 21, 2015 at 3:52 AM, Jiang, Zhipeng zhipeng.ji...@intel.com
wrote:

  Hi there,



 This question may seem to be kind of naïve, but what’s the difference
 between *MEMORY_AND_DISK* and *MEMORY_AND_DISK_SER*?



 If I call *rdd.persist(StorageLevel.MEMORY_AND_DISK)*, the BlockManager
 won’t serialize the *rdd*?



 Thanks,

 Zhipeng

Re: Spark sql error while writing Parquet file- Trying to write more fields than contained in row

2015-05-19 Thread Todd Nist

I believe your looking for  df.na.fill in scala, in pySpark Module it is
fillna (http://spark.apache.org/docs/latest/api/python/pyspark.sql.html)

from the docs:

df4.fillna({'age': 50, 'name': 'unknown'}).show()age height name10  80
Alice5   null   Bob50  null   Tom50  null   unknown


On Mon, May 18, 2015 at 11:01 PM, Chandra Mohan, Ananda Vel Murugan 
ananda.muru...@honeywell.com wrote:

  Hi,



 Thanks for the response. But I could not see fillna function in DataFrame
 class.







 Is it available in some specific version of Spark sql. This is what I have
 in my pom.xml



 dependency

   groupIdorg.apache.spark/groupId

   artifactIdspark-sql_2.10/artifactId

   version1.3.1/version

/dependency



 Regards,

 Anand.C



 *From:* ayan guha [mailto:guha.a...@gmail.com]
 *Sent:* Monday, May 18, 2015 5:19 PM
 *To:* Chandra Mohan, Ananda Vel Murugan; user
 *Subject:* Re: Spark sql error while writing Parquet file- Trying to
 write more fields than contained in row



 Hi



 Give a try with dtaFrame.fillna function to fill up missing column



 Best

 Ayan



 On Mon, May 18, 2015 at 8:29 PM, Chandra Mohan, Ananda Vel Murugan 
 ananda.muru...@honeywell.com wrote:

 Hi,



 I am using spark-sql to read a CSV file and write it as parquet file. I am
 building the schema using the following code.



 String schemaString = a b c;

ListStructField fields = *new* ArrayListStructField();

MetadataBuilder mb = *new* MetadataBuilder();

mb.putBoolean(nullable, *true*);

Metadata m = mb.build();

*for* (String fieldName: schemaString.split( )) {

 fields.add(*new* StructField(fieldName,DataTypes.
 *DoubleType*,*true*, m));

}

StructType schema = DataTypes.*createStructType*(fields);



 Some of the rows in my input csv does not contain three columns. After
 building my JavaRDDRow, I create data frame as shown below using the
 RDD and schema.



 DataFrame darDataFrame = sqlContext.createDataFrame(rowRDD, schema);



 Finally I try to save it as Parquet file



 darDataFrame.saveAsParquetFile(/home/anand/output.parquet”)



 I get this error when saving it as Parquet file



 java.lang.IndexOutOfBoundsException: Trying to write more fields than
 contained in row (3  2)



 I understand the reason behind this error. Some of my rows in Row RDD does
 not contain three elements as some rows in my input csv does not contain
 three columns. But while building the schema, I am specifying every field
 as nullable. So I believe, it should not throw this error. Can anyone help
 me fix this error. Thank you.



 Regards,

 Anand.C









 --

 Best Regards,
 Ayan Guha

Re: group by and distinct performance issue

2015-05-19 Thread Todd Nist

You may want to look at this tooling for helping identify performance
issues and bottlenecks:

https://github.com/kayousterhout/trace-analysis

I believe this is slated to become part of the web ui in the 1.4 release,
in fact based on the status of the JIRA,
https://issues.apache.org/jira/browse/SPARK-6418, looks like it is complete.


On Tue, May 19, 2015 at 3:56 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 Hi Peer,

 If you open the driver UI (running on port 4040) you can see the stages
 and the tasks happening inside it. Best way to identify the bottleneck for
 a stage is to see if there's any time spending on GC, and how many tasks
 are there per stage (it should be a number  total # cores to achieve max
 parallelism). Also you can see for each task how long does it take etc into
 consideration.

 Thanks
 Best Regards

 On Tue, May 19, 2015 at 12:58 PM, Peer, Oded oded.p...@rsa.com wrote:

  I am running Spark over Cassandra to process a single table.

 My task reads a single days’ worth of data from the table and performs 50
 group by and distinct operations, counting distinct userIds by different
 grouping keys.

 My code looks like this:



JavaRddRow rdd = sc.parallelize().mapPartitions().cache() // reads
 the data from the table

for each groupingKey {

   JavaPairRddGroupingKey, UserId groupByRdd = rdd.mapToPair();

   JavaPairRDDGroupingKey, Long countRdd =
 groupByRdd.distinct().mapToPair().reduceByKey() // counts distinct values
 per grouping key

}



 The distinct() stage takes about 2 minutes for every groupByValue, and my
 task takes well over an hour to complete.

 My cluster has 4 nodes and 30 GB of RAM per Spark process, the table size
 is 4 GB.



 How can I identify the bottleneck more accurately? Is it caused by
 shuffling data?

 How can I improve the performance?



 Thanks,

 Oded

Re: value toDF is not a member of RDD object

2015-05-13 Thread Todd Nist

I believe what Dean Wampler was suggesting is to use the sqlContext not the
sparkContext (sc), which is where the createDataFrame function resides:

https://spark.apache.org/docs/1.3.1/api/scala/index.html#org.apache.spark.sql.SQLContext

HTH.

-Todd

On Wed, May 13, 2015 at 6:00 AM, SLiZn Liu sliznmail...@gmail.com wrote:

 Additionally, after I successfully packaged the code, and submitted via 
 spark-submit
 webcat_2.11-1.0.jar, the following error was thrown at the line where
 toDF() been called:

 Exception in thread main java.lang.NoSuchMethodError: 
 scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)Lscala/reflect/api/JavaUniverse$JavaMirror;
   at WebcatApp$.main(webcat.scala:49)
   at WebcatApp.main(webcat.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
   at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

 Unsurprisingly, if I remove toDF, no error occurred.

 I have moved the case class definition outside of main but inside the
 outer object scope, and removed the provided specification in build.sbt.
 However, when I tried *Dean Wampler*‘s suggestion of using
 sc.createDataFrame() the compiler says this function is not a member of sc,
 and I cannot find any reference in the latest documents. What else should I
 try?

 REGARDS,
 Todd Leo
 

 On Wed, May 13, 2015 at 11:27 AM SLiZn Liu sliznmail...@gmail.com wrote:

 Thanks folks, really appreciate all your replies! I tried each of your
 suggestions and in particular, *Animesh*‘s second suggestion of *making
 case class definition global* helped me getting off the trap.

 Plus, I should have paste my entire code with this mail to help the
 diagnose.

 REGARDS,
 Todd Leo
 

 On Wed, May 13, 2015 at 12:10 AM Dean Wampler deanwamp...@gmail.com
 wrote:

 It's the import statement Olivier showed that makes the method
 available.

 Note that you can also use `sc.createDataFrame(myRDD)`, without the need
 for the import statement. I personally prefer this approach.

 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

 On Tue, May 12, 2015 at 9:33 AM, Olivier Girardot ssab...@gmail.com
 wrote:

 you need to instantiate a SQLContext :
 val sc : SparkContext = ...
 val sqlContext = new SQLContext(sc)
 import sqlContext.implicits._

 Le mar. 12 mai 2015 à 12:29, SLiZn Liu sliznmail...@gmail.com a
 écrit :

 I added `libraryDependencies += org.apache.spark % spark-sql_2.11
 % 1.3.1` to `build.sbt` but the error remains. Do I need to import
 modules other than `import org.apache.spark.sql.{ Row, SQLContext }`?

 On Tue, May 12, 2015 at 5:56 PM Olivier Girardot ssab...@gmail.com
 wrote:

 toDF is part of spark SQL so you need Spark SQL dependency + import
 sqlContext.implicits._ to get the toDF method.

 Regards,

 Olivier.

 Le mar. 12 mai 2015 à 11:36, SLiZn Liu sliznmail...@gmail.com a
 écrit :

 Hi User Group,

 I’m trying to reproduce the example on Spark SQL Programming Guide
 https://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection,
 and got a compile error when packaging with sbt:

 [error] myfile.scala:30: value toDF is not a member of 
 org.apache.spark.rdd.RDD[Person]
 [error] val people = 
 sc.textFile(examples/src/main/resources/people.txt).map(_.split(,)).map(p
  = Person(p(0), p(1).trim.toInt)).toDF()
 [error] 
  ^
 [error] one error found
 [error] (compile:compileIncremental) Compilation failed
 [error] Total time: 3 s, completed May 12, 2015 4:11:53 PM

 I double checked my code includes import sqlContext.implicits._
 after reading this post
 https://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/%3c1426522113299-22083.p...@n3.nabble.com%3E
 on spark mailing list, even tried to use toDF(col1, col2)
 suggested by Xiangrui Meng in that post and got the same error.

 The Spark version is specified in build.sbt file as follows:

 scalaVersion := 2.11.6
 libraryDependencies += org.apache.spark % spark-core_2.11 % 1.3.1 
 % provided
 libraryDependencies += org.apache.spark % spark-mllib_2.11 % 1.3.1

 Anyone have ideas the cause of this error?

 REGARDS,
 Todd Leo

Re: Spark does not delete temporary directories

2015-05-07 Thread Todd Nist

Have you tried to set the following?

spark.worker.cleanup.enabled=true
spark.worker.cleanup.appDataTtl=seconds”



On Thu, May 7, 2015 at 2:39 AM, Taeyun Kim taeyun@innowireless.com
wrote:

 Hi,



 After a spark program completes, there are 3 temporary directories remain
 in the temp directory.

 The file names are like this: spark-2e389487-40cc-4a82-a5c7-353c0feefbb7



 And the Spark program runs on Windows, a snappy DLL file also remains in
 the temp directory.

 The file name is like this:
 snappy-1.0.4.1-6e117df4-97b6-4d69-bf9d-71c4a627940c-snappyjava



 They are created every time the Spark program runs. So the number of files
 and directories keeps growing.



 How can let them be deleted?



 Spark version is 1.3.1 with Hadoop 2.6.



 Thanks.

Re: AvroFiles

2015-05-05 Thread Todd Nist

Are you using Kryo or Java serialization? I found this post useful:

http://stackoverflow.com/questions/23962796/kryo-readobject-cause-nullpointerexception-with-arraylist

If using kryo, you need to register the classes with kryo, something like
this:


sc.registerKryoClasses(Array(
classOf[ConfigurationProperty],
   classOf[Event]
))

Or create a registrator something like this:

class ODSKryoRegistrator extends KryoRegistrator {
  override def registerClasses(kryo: Kryo) {
kryo.register(classOf[ConfigurationProperty], new
AvroSerializer[ConfigurationProperty]())
kryo.register(classOf[Event], new AvroSerializer[Event]()))
  }

I encountered a similar error since several of the Avor core classes are
not marked Serializable.

HTH.

Todd

On Tue, May 5, 2015 at 7:09 PM, Pankaj Deshpande ppa...@gmail.com wrote:

 Hi I am using Spark 1.3.1 to read an avro file stored on HDFS. The avro
 file was created using Avro 1.7.7. Similar to the example mentioned in
 http://www.infoobjects.com/spark-with-avro/
 I am getting a nullPointerException on Schema read. It could be a avro
 version mismatch. Has anybody had a similar issue with avro.


 Thanks

Parquet Partition Strategy - how to partition data correctly

2015-05-05 Thread Todd Nist

Hi,

I have a DataFrame that represents my data looks like this:

+-++
| col_name| data_type  |
+-++
| obj_id  | string |
| type| string |
| name| string |
| metric_name | string |
| value   | double |
| ts  | timestamp  |
+-++

It is working fine, and I can store it to parquet with:

df.saveAsParquetFile(/user/data/metrics)

I would like to leverage parquet partitioning as referenced here,
https://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files

I would like to see a representation something like this:

usr
|__ data
  |__ metrics
|__ type=Virtual Machine
  |__ objId=1234
|__ metricName=CPU Demand
  |__ mmdd
|__ data.parquet
|__ metricName=CPU Utilization
  |__ mmdd
|__ data.parquet
  |__ objId=5678
|__ metricName=CPU Demand
  |__ mmdd
|__ data.parquet
|__ type=Application
  |__ objId=0009
|__ metricName=Response Time
  |__ mmdd
|__ data.parquet
|__ metricName=Slow Response
  |__ mmdd
|__ data.parquet
  |__ objId=0303
|__ metricName=Response Time
  |__ mmdd
|__ data.parquet


What is the correct way to achieve this? I can do something like:

df.map{  case Row(nodeType: String, objId: String, name: String,
metricName: String, value: Double, ts: java.sql.Timestamp) =

  ...
   // construct path
   val path = 
s/usr/data/metrics/type=${Row.nodeType}/objId=${Row.objId}/metricName=${Row.metricName}/floorToDay(ts)
   // save record as parquet
   df.saveAsParquet(path, Row)

  ...}
Is this the right approach or is there a more optimal approach?  This would save
every row as an individual file.  I will receive multiple entries for a
given metric, type and objId combination in a given day.

TIA for the assistance.

-Todd

Spark Streaming Kafka Avro NPE on deserialization of payload

2015-05-01 Thread Todd Nist

*Resending as I do not see that this made it to the mailing list, sorry if
in fact it did an is just nor reflected online yet.*

I’m very perplexed with the following. I have a set of AVRO generated
objects that are sent to a SparkStreaming job via Kafka. The SparkStreaming
job follows the receiver-based approach. I am encountering the below error
when I attempt to de serialize the payload:

15/04/30 17:49:25 INFO MapOutputTrackerMasterActor: Asked to send map
output locations for shuffle 9 to
sparkExecutor@192.168.1.3:6105115/04/30 17:49:25 INFO
MapOutputTrackerMaster: Size of output statuses for shuffle 9 is 140
bytes15/04/30 17:49:25 ERROR TaskResultGetter: Exception while getting
task resultcom.esotericsoftware.kryo.KryoException:
java.lang.NullPointerException
Serialization trace:
relations (com.opsdatastore.model.ObjectDetails)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
at 
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:338)
at 
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
at 
org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:173)
at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79)
at 
org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:621)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:379)
at 
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:82)
at 
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:51)
at 
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:51)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1618)
at 
org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:50)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
at org.apache.avro.generic.GenericData$Array.add(GenericData.java:200)
at 
com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109)
at 
com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18)
at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
... 17 more15/04/30 17:49:25 INFO TaskSchedulerImpl: Removed TaskSet
20.0, whose tasks have all completed, from pool

Basic code looks like this.

Register the class with Kryo as follows:

val sc = new SparkConf(true)
  .set(spark.streaming.unpersist, true)
  .setAppName(StreamingKafkaConsumer)
  .set(spark.serializer, org.apache.spark.serializer.KryoSerializer)

// register all related AVRO generated classes
sc.registerKryoClasses(Array(
classOf[ConfigurationProperty],
classOf[Event],
classOf[Identifier],
classOf[Metric],
classOf[ObjectDetails],
classOf[Relation],
classOf[RelationProperty]
))

Use the receiver based approach to consume messages from Kafka:

 val messages = KafkaUtils.createStream[Array[Byte], Array[Byte],
DefaultDecoder, DefaultDecoder](ssc, kafkaParams, topics,
storageLevel)

Now process the received messages:

val raw = messages.map(_._2)
val dStream = raw.map(
  byte = {
// Avro Decoder
println(Byte length:  + byte.length)
val decoder = new AvroDecoder[ObjectDetails](schema =
ObjectDetails.getClassSchema)
val message = decoder.fromBytes(byte)
println(sAvroMessage : Type : ${message.getType}, Payload : $message)
message
  }
)

When i look in the logs of the workers, in standard out i can se the
messages being printed, in fact I’m even able to access the Type field with
out issue:

Byte length: 315
AvroMessage : Type : Storage, Payload : {name: Storage 1, type:
Storage, vendor: 6274g51cbkmkqisk, model: lk95hqk9m10btaot,
timestamp: 1430428565141, identifiers: {ID: {name: ID,
value: Storage-1}}, configuration: null, metrics: {Disk Space
Usage (GB): {name: Disk Space Usage (GB), source: Generated,
values: {1430428565356: {timestamp: 1430428565356, value:
42.55948347907833}}}, Disk Space Capacity (GB): {name: Disk Space
Capacity (GB), source: Generated, values: {1430428565356:
{timestamp: 1430428565356, value: 38.980024705429095,
relations: [{type: parent, object_type: Virtual Machine,
properties: {ID: {name: ID, value: Virtual Machine-1}}}],

Spark Streaming Kafka Avro NPE on deserialization of payload

2015-04-30 Thread Todd Nist

I’m very perplexed with the following. I have a set of AVRO generated
objects that are sent to a SparkStreaming job via Kafka. The SparkStreaming
job follows the receiver-based approach. I am encountering the below error
when I attempt to de serialize the payload:

15/04/30 17:49:25 INFO MapOutputTrackerMasterActor: Asked to send map
output locations for shuffle 9 to
sparkExecutor@192.168.1.3:6105115/04/30 17:49:25 INFO
MapOutputTrackerMaster: Size of output statuses for shuffle 9 is 140
bytes15/04/30 17:49:25 ERROR TaskResultGetter: Exception while getting
task resultcom.esotericsoftware.kryo.KryoException:
java.lang.NullPointerException
Serialization trace:
relations (com.opsdatastore.model.ObjectDetails)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
at 
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:338)
at 
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
at 
org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:173)
at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79)
at 
org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:621)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:379)
at 
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:82)
at 
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:51)
at 
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:51)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1618)
at 
org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:50)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
at org.apache.avro.generic.GenericData$Array.add(GenericData.java:200)
at 
com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109)
at 
com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18)
at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
... 17 more15/04/30 17:49:25 INFO TaskSchedulerImpl: Removed TaskSet
20.0, whose tasks have all completed, from pool

Basic code looks like this.

Register the class with Kryo as follows:

val sc = new SparkConf(true)
  .set(spark.streaming.unpersist, true)
  .setAppName(StreamingKafkaConsumer)
  .set(spark.serializer, org.apache.spark.serializer.KryoSerializer)

// register all related AVRO generated classes
sc.registerKryoClasses(Array(
classOf[ConfigurationProperty],
classOf[Event],
classOf[Identifier],
classOf[Metric],
classOf[ObjectDetails],
classOf[Relation],
classOf[RelationProperty]
))

Use the receiver based approach to consume messages from Kafka:

 val messages = KafkaUtils.createStream[Array[Byte], Array[Byte],
DefaultDecoder, DefaultDecoder](ssc, kafkaParams, topics,
storageLevel)

Now process the received messages:

val raw = messages.map(_._2)
val dStream = raw.map(
  byte = {
// Avro Decoder
println(Byte length:  + byte.length)
val decoder = new AvroDecoder[ObjectDetails](schema =
ObjectDetails.getClassSchema)
val message = decoder.fromBytes(byte)
println(sAvroMessage : Type : ${message.getType}, Payload : $message)
message
  }
)

When i look in the logs of the workers, in standard out i can se the
messages being printed, in fact I’m even able to access the Type field with
out issue:

Byte length: 315
AvroMessage : Type : Storage, Payload : {name: Storage 1, type:
Storage, vendor: 6274g51cbkmkqisk, model: lk95hqk9m10btaot,
timestamp: 1430428565141, identifiers: {ID: {name: ID,
value: Storage-1}}, configuration: null, metrics: {Disk Space
Usage (GB): {name: Disk Space Usage (GB), source: Generated,
values: {1430428565356: {timestamp: 1430428565356, value:
42.55948347907833}}}, Disk Space Capacity (GB): {name: Disk Space
Capacity (GB), source: Generated, values: {1430428565356:
{timestamp: 1430428565356, value: 38.980024705429095,
relations: [{type: parent, object_type: Virtual Machine,
properties: {ID: {name: ID, value: Virtual Machine-1}}}],
events: [], components: []}

The ObjectDetails which is generated from AVRO, has a relations field which
is of type java.util.List:

Re: Calculating the averages for each KEY in a Pairwise (K,V) RDD ...

2015-04-28 Thread Todd Nist

Can you simply apply the
https://spark.apache.org/docs/1.3.1/api/scala/index.html#org.apache.spark.util.StatCounter
to this?  You should be able to do something like this:

val stats = RDD.map(x = x._2).stats()

-Todd

On Tue, Apr 28, 2015 at 10:00 AM, subscripti...@prismalytics.io 
subscripti...@prismalytics.io wrote:

  Hello Friends:

 I generated a Pair RDD with K/V pairs, like so:

 
  rdd1.take(10) # Show a small sample.
  [(u'2013-10-09', 7.60117302052786),
  (u'2013-10-10', 9.322709163346612),
  (u'2013-10-10', 28.264462809917358),
  (u'2013-10-07', 9.664429530201343),
  (u'2013-10-07', 12.461538461538463),
  (u'2013-10-09', 20.76923076923077),
  (u'2013-10-08', 11.842105263157894),
  (u'2013-10-13', 32.32514177693762),
  (u'2013-10-13', 26.246),
  (u'2013-10-13', 10.693069306930692)]

 Now from the above RDD, I would like to calculate an average of the VALUES
 for each KEY.
 I can do so as shown here, which does work:

  countsByKey = sc.broadcast(rdd1.countByKey()) # SAMPLE OUTPUT of
 countsByKey.value: {u'2013-09-09': 215, u'2013-09-08': 69, ... snip ...}
  rdd1 = rdd1.reduceByKey(operator.add) # Calculate the numerator (i.e.
 the SUM).
  rdd1 = rdd1.map(lambda x: (x[0], x[1]/countsByKey.value[x[0]])) #
 Divide each SUM by it's denominator (i.e. COUNT)
  print(rdd1.collect())
   [(u'2013-10-09', 11.235365503035176),
(u'2013-10-07', 23.39500642456595),
... snip ...
   ]

 But I wonder if the above semantics/approach is the optimal one; or
 whether perhaps there is a single API call
 that handles common use case.

 Improvement thoughts welcome. =:)

 Thank you,
 nmv

Re: Cannot saveAsParquetFile from a RDD of case class

2015-04-14 Thread Todd Nist

I think docs are correct.  If you follow the example from the docs and add
this import shown below, I believe you will get what your looking for:

// This is used to implicitly convert an RDD to a DataFrame.import
sqlContext.implicits._

You could also simply take your rdd and do the following:

logs.toDF.saveAsParquetFile(s3n://xxx/xxx)


-Todd

On Tue, Apr 14, 2015 at 3:50 AM, pishen tsai pishe...@gmail.com wrote:

 OK, it do work.
 Maybe it will be better to update this usage in the official Spark SQL
 tutorial:
 http://spark.apache.org/docs/latest/sql-programming-guide.html

 Thanks,
 pishen


 2015-04-14 15:30 GMT+08:00 fightf...@163.com fightf...@163.com:

 Hi，there

 If you want to use the saveAsParquetFile, you may want to use
 val log_df =  sqlContext.createDataFrame(logs)

 And then you can issue log_df.saveAsParquetFile (path)

 Best,
 Sun.

 --
 fightf...@163.com


 *From:* pishen pishe...@gmail.com
 *Date:* 2015-04-14 15:18
 *To:* user user@spark.apache.org
 *Subject:* Cannot saveAsParquetFile from a RDD of case class
 Hello,

 I tried to follow the tutorial of Spark SQL, but is not able to
 saveAsParquetFile from a RDD of case class.
 Here is my Main.scala and build.sbt
 https://gist.github.com/pishen/939cad3da612ec03249f

 At line 34, compiler said that value saveAsParquetFile is not a member
 of org.apache.spark.rdd.RDD[core.Log]

 Any suggestion on how to solve this?

 Thanks,
 pishen

 --
 View this message in context: Cannot saveAsParquetFile from a RDD of
 case class
 http://apache-spark-user-list.1001560.n3.nabble.com/Cannot-saveAsParquetFile-from-a-RDD-of-case-class-tp22488.html
 Sent from the Apache Spark User List mailing list archive
 http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.

Re: Advice using Spark SQL and Thrift JDBC Server

2015-04-09 Thread Todd Nist

Hi Mohammed,

Sorry, I guess I was not really clear in my response.  Yes sbt fails, the
-DskipTests is for mvn as I showed it in the example on how II built it.

I do not believe that -DskipTests has any impact in sbt, but could be
wrong.  sbt package should skip tests.  I did not try to track down where
the dependency was coming from.  Based on Patrick comments it sound like
this is now resolved.

Sorry for the confustion.

-Todd

On Wed, Apr 8, 2015 at 4:38 PM, Todd Nist tsind...@gmail.com wrote:

 Hi Mohammed,

 I think you just need to add -DskipTests to you build.  Here is how I
 built it:

 mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver
 -DskipTests clean package install

 build/sbt does however fail even if only doing package which should skip
 tests.

 I am able to build the MyThriftServer above now.

 Thanks Michael for the assistance.

 -Todd

 On Wed, Apr 8, 2015 at 3:39 PM, Mohammed Guller moham...@glassbeam.com
 wrote:

  Michael,

 Thank you!



 Looks like the sbt build is broken for 1.3. I downloaded the source code
 for 1.3, but I get the following error a few minutes after I run “sbt/sbt
 publishLocal”



 [error] (network-shuffle/*:update) sbt.ResolveException: unresolved
 dependency: org.apache.spark#spark-network-common_2.10;1.3.0: configuration
 not public in org.apache.spark#spark-network-common_2.10;1.3.0: 'test'. It
 was required from org.apache.spark#spark-network-shuffle_2.10;1.3.0 test

 [error] Total time: 106 s, completed Apr 8, 2015 12:33:45 PM



 Mohammed



 *From:* Michael Armbrust [mailto:mich...@databricks.com]
 *Sent:* Wednesday, April 8, 2015 11:54 AM
 *To:* Mohammed Guller
 *Cc:* Todd Nist; James Aley; user; Patrick Wendell

 *Subject:* Re: Advice using Spark SQL and Thrift JDBC Server



 Sorry guys.  I didn't realize that
 https://issues.apache.org/jira/browse/SPARK-4925 was not fixed yet.



 You can publish locally in the mean time (sbt/sbt publishLocal).



 On Wed, Apr 8, 2015 at 8:29 AM, Mohammed Guller moham...@glassbeam.com
 wrote:

 +1



 Interestingly, I ran into the exactly the same issue yesterday.  I
 couldn’t find any documentation about which project to include as a
 dependency in build.sbt to use HiveThriftServer2. Would appreciate help.



 Mohammed



 *From:* Todd Nist [mailto:tsind...@gmail.com]
 *Sent:* Wednesday, April 8, 2015 5:49 AM
 *To:* James Aley
 *Cc:* Michael Armbrust; user
 *Subject:* Re: Advice using Spark SQL and Thrift JDBC Server



 To use the HiveThriftServer2.startWithContext, I thought one would use
 the  following artifact in the build:



 org.apache.spark%% spark-hive-thriftserver   % 1.3.0



 But I am unable to resolve the artifact.  I do not see it in maven
 central or any other repo.  Do I need to build Spark and publish locally or
 just missing something obvious here?



 Basic class is like this:



 import org.apache.spark.{SparkConf, SparkContext}



 import  org.apache.spark.sql.hive.HiveContext

 import org.apache.spark.sql.hive.HiveMetastoreTypes._

 import org.apache.spark.sql.types._

 import  org.apache.spark.sql.hive.thriftserver._



 object MyThriftServer {



   val sparkConf = new SparkConf()

 // master is passed to spark-submit, but could also be specified 
 explicitely

 // .setMaster(sparkMaster)

 .setAppName(My ThriftServer)

 .set(spark.cores.max, 2)

   val sc = new SparkContext(sparkConf)

   val  sparkContext  =  sc

   import  sparkContext._

   val  sqlContext  =  new  HiveContext(sparkContext)

   import  sqlContext._

   import sqlContext.implicits._



 // register temp tables here   HiveThriftServer2.startWithContext(sqlContext)

 }

  Build has the following:



 scalaVersion := 2.10.4



 val SPARK_VERSION = 1.3.0





 libraryDependencies ++= Seq(

 org.apache.spark %% spark-streaming-kafka % SPARK_VERSION

   exclude(org.apache.spark, spark-core_2.10)

   exclude(org.apache.spark, spark-streaming_2.10)

   exclude(org.apache.spark, spark-sql_2.10)

   exclude(javax.jms, jms),

 org.apache.spark %% spark-core  % SPARK_VERSION %  provided,

 org.apache.spark %% spark-streaming % SPARK_VERSION %  provided,

 org.apache.spark  %% spark-sql  % SPARK_VERSION % provided,

 org.apache.spark  %% spark-hive % SPARK_VERSION % provided,

 org.apache.spark %% spark-hive-thriftserver  % SPARK_VERSION   %
 provided,

 org.apache.kafka %% kafka % 0.8.1.1

   exclude(javax.jms, jms)

   exclude(com.sun.jdmk, jmxtools)

   exclude(com.sun.jmx, jmxri),

 joda-time % joda-time % 2.7,

 log4j % log4j % 1.2.14

   exclude(com.sun.jdmk, jmxtools)

   exclude(com.sun.jmx, jmxri)

   )



 Appreciate the assistance.



 -Todd



 On Tue, Apr 7, 2015 at 4:09 PM, James Aley james.a...@swiftkey.com
 wrote:

 Excellent, thanks for your help, I appreciate your advice!

 On 7 Apr 2015 20:43, Michael Armbrust mich...@databricks.com wrote:

 That should totally work.  The other option would be to run

Re: Advice using Spark SQL and Thrift JDBC Server

2015-04-08 Thread Todd Nist

To use the HiveThriftServer2.startWithContext, I thought one would use the
 following artifact in the build:

org.apache.spark%% spark-hive-thriftserver   % 1.3.0

But I am unable to resolve the artifact.  I do not see it in maven central
or any other repo.  Do I need to build Spark and publish locally or just
missing something obvious here?

Basic class is like this:

import org.apache.spark.{SparkConf, SparkContext}

import  org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.hive.HiveMetastoreTypes._
import org.apache.spark.sql.types._
import  org.apache.spark.sql.hive.thriftserver._

object MyThriftServer {

  val sparkConf = new SparkConf()
// master is passed to spark-submit, but could also be specified explicitely
// .setMaster(sparkMaster)
.setAppName(My ThriftServer)
.set(spark.cores.max, 2)
  val sc = new SparkContext(sparkConf)
  val  sparkContext  =  sc
  import  sparkContext._
  val  sqlContext  =  new  HiveContext(sparkContext)
  import  sqlContext._
  import sqlContext.implicits._

// register temp tables here   HiveThriftServer2.startWithContext(sqlContext)
}

Build has the following:

scalaVersion := 2.10.4

val SPARK_VERSION = 1.3.0


libraryDependencies ++= Seq(
org.apache.spark %% spark-streaming-kafka % SPARK_VERSION
  exclude(org.apache.spark, spark-core_2.10)
  exclude(org.apache.spark, spark-streaming_2.10)
  exclude(org.apache.spark, spark-sql_2.10)
  exclude(javax.jms, jms),
org.apache.spark %% spark-core  % SPARK_VERSION %  provided,
org.apache.spark %% spark-streaming % SPARK_VERSION %  provided,
org.apache.spark  %% spark-sql  % SPARK_VERSION % provided,
org.apache.spark  %% spark-hive % SPARK_VERSION % provided,
org.apache.spark %% spark-hive-thriftserver  % SPARK_VERSION   %
provided,
org.apache.kafka %% kafka % 0.8.1.1
  exclude(javax.jms, jms)
  exclude(com.sun.jdmk, jmxtools)
  exclude(com.sun.jmx, jmxri),
joda-time % joda-time % 2.7,
log4j % log4j % 1.2.14
  exclude(com.sun.jdmk, jmxtools)
  exclude(com.sun.jmx, jmxri)
  )

Appreciate the assistance.

-Todd

On Tue, Apr 7, 2015 at 4:09 PM, James Aley james.a...@swiftkey.com wrote:

 Excellent, thanks for your help, I appreciate your advice!
 On 7 Apr 2015 20:43, Michael Armbrust mich...@databricks.com wrote:

 That should totally work.  The other option would be to run a persistent
 metastore that multiple contexts can talk to and periodically run a job
 that creates missing tables.  The trade-off here would be more complexity,
 but less downtime due to the server restarting.

 On Tue, Apr 7, 2015 at 12:34 PM, James Aley james.a...@swiftkey.com
 wrote:

 Hi Michael,

 Thanks so much for the reply - that really cleared a lot of things up
 for me!

 Let me just check that I've interpreted one of your suggestions for (4)
 correctly... Would it make sense for me to write a small wrapper app that
 pulls in hive-thriftserver as a dependency, iterates my Parquet
 directory structure to discover tables and registers each as a temp table
 in some context, before calling HiveThriftServer2.createWithContext as
 you suggest?

 This would mean that to add new content, all I need to is restart that
 app, which presumably could also be avoided fairly trivially by
 periodically restarting the server with a new context internally. That
 certainly beats manual curation of Hive table definitions, if it will work?


 Thanks again,

 James.

 On 7 April 2015 at 19:30, Michael Armbrust mich...@databricks.com
 wrote:

 1) What exactly is the relationship between the thrift server and Hive?
 I'm guessing Spark is just making use of the Hive metastore to access 
 table
 definitions, and maybe some other things, is that the case?


 Underneath the covers, the Spark SQL thrift server is executing queries
 using a HiveContext.  In this mode, nearly all computation is done with
 Spark SQL but we try to maintain compatibility with Hive wherever
 possible.  This means that you can write your queries in HiveQL, read
 tables from the Hive metastore, and use Hive UDFs UDTs UDAFs, etc.

 The one exception here is Hive DDL operations (CREATE TABLE, etc).
 These are passed directly to Hive code and executed there.  The Spark SQL
 DDL is sufficiently different that we always try to parse that first, and
 fall back to Hive when it does not parse.

 One possibly confusing point here, is that you can persist Spark SQL
 tables into the Hive metastore, but this is not the same as a Hive table.
 We are only use the metastore as a repo for metadata, but are not using
 their format for the information in this case (as we have datasources that
 hive does not understand, including things like schema auto discovery).

 HiveQL DDL, run by Hive but can be read by Spark SQL: CREATE TABLE t (x
 INT) SORTED AS PARQUET
 Spark SQL DDL, run by Spark SQL, stored in metastore, cannot be read by
 hive: CREATE TABLE t USING parquet (path '/path/to/data')


 2) Am I therefore

Re: Advice using Spark SQL and Thrift JDBC Server

2015-04-08 Thread Todd Nist

Hi Mohammed,

I think you just need to add -DskipTests to you build.  Here is how I built
it:

mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver
-DskipTests clean package install

build/sbt does however fail even if only doing package which should skip
tests.

I am able to build the MyThriftServer above now.

Thanks Michael for the assistance.

-Todd

On Wed, Apr 8, 2015 at 3:39 PM, Mohammed Guller moham...@glassbeam.com
wrote:

  Michael,

 Thank you!



 Looks like the sbt build is broken for 1.3. I downloaded the source code
 for 1.3, but I get the following error a few minutes after I run “sbt/sbt
 publishLocal”



 [error] (network-shuffle/*:update) sbt.ResolveException: unresolved
 dependency: org.apache.spark#spark-network-common_2.10;1.3.0: configuration
 not public in org.apache.spark#spark-network-common_2.10;1.3.0: 'test'. It
 was required from org.apache.spark#spark-network-shuffle_2.10;1.3.0 test

 [error] Total time: 106 s, completed Apr 8, 2015 12:33:45 PM



 Mohammed



 *From:* Michael Armbrust [mailto:mich...@databricks.com]
 *Sent:* Wednesday, April 8, 2015 11:54 AM
 *To:* Mohammed Guller
 *Cc:* Todd Nist; James Aley; user; Patrick Wendell

 *Subject:* Re: Advice using Spark SQL and Thrift JDBC Server



 Sorry guys.  I didn't realize that
 https://issues.apache.org/jira/browse/SPARK-4925 was not fixed yet.



 You can publish locally in the mean time (sbt/sbt publishLocal).



 On Wed, Apr 8, 2015 at 8:29 AM, Mohammed Guller moham...@glassbeam.com
 wrote:

 +1



 Interestingly, I ran into the exactly the same issue yesterday.  I
 couldn’t find any documentation about which project to include as a
 dependency in build.sbt to use HiveThriftServer2. Would appreciate help.



 Mohammed



 *From:* Todd Nist [mailto:tsind...@gmail.com]
 *Sent:* Wednesday, April 8, 2015 5:49 AM
 *To:* James Aley
 *Cc:* Michael Armbrust; user
 *Subject:* Re: Advice using Spark SQL and Thrift JDBC Server



 To use the HiveThriftServer2.startWithContext, I thought one would use the
  following artifact in the build:



 org.apache.spark%% spark-hive-thriftserver   % 1.3.0



 But I am unable to resolve the artifact.  I do not see it in maven central
 or any other repo.  Do I need to build Spark and publish locally or just
 missing something obvious here?



 Basic class is like this:



 import org.apache.spark.{SparkConf, SparkContext}



 import  org.apache.spark.sql.hive.HiveContext

 import org.apache.spark.sql.hive.HiveMetastoreTypes._

 import org.apache.spark.sql.types._

 import  org.apache.spark.sql.hive.thriftserver._



 object MyThriftServer {



   val sparkConf = new SparkConf()

 // master is passed to spark-submit, but could also be specified 
 explicitely

 // .setMaster(sparkMaster)

 .setAppName(My ThriftServer)

 .set(spark.cores.max, 2)

   val sc = new SparkContext(sparkConf)

   val  sparkContext  =  sc

   import  sparkContext._

   val  sqlContext  =  new  HiveContext(sparkContext)

   import  sqlContext._

   import sqlContext.implicits._



 // register temp tables here   HiveThriftServer2.startWithContext(sqlContext)

 }

  Build has the following:



 scalaVersion := 2.10.4



 val SPARK_VERSION = 1.3.0





 libraryDependencies ++= Seq(

 org.apache.spark %% spark-streaming-kafka % SPARK_VERSION

   exclude(org.apache.spark, spark-core_2.10)

   exclude(org.apache.spark, spark-streaming_2.10)

   exclude(org.apache.spark, spark-sql_2.10)

   exclude(javax.jms, jms),

 org.apache.spark %% spark-core  % SPARK_VERSION %  provided,

 org.apache.spark %% spark-streaming % SPARK_VERSION %  provided,

 org.apache.spark  %% spark-sql  % SPARK_VERSION % provided,

 org.apache.spark  %% spark-hive % SPARK_VERSION % provided,

 org.apache.spark %% spark-hive-thriftserver  % SPARK_VERSION   %
 provided,

 org.apache.kafka %% kafka % 0.8.1.1

   exclude(javax.jms, jms)

   exclude(com.sun.jdmk, jmxtools)

   exclude(com.sun.jmx, jmxri),

 joda-time % joda-time % 2.7,

 log4j % log4j % 1.2.14

   exclude(com.sun.jdmk, jmxtools)

   exclude(com.sun.jmx, jmxri)

   )



 Appreciate the assistance.



 -Todd



 On Tue, Apr 7, 2015 at 4:09 PM, James Aley james.a...@swiftkey.com
 wrote:

 Excellent, thanks for your help, I appreciate your advice!

 On 7 Apr 2015 20:43, Michael Armbrust mich...@databricks.com wrote:

 That should totally work.  The other option would be to run a persistent
 metastore that multiple contexts can talk to and periodically run a job
 that creates missing tables.  The trade-off here would be more complexity,
 but less downtime due to the server restarting.



 On Tue, Apr 7, 2015 at 12:34 PM, James Aley james.a...@swiftkey.com
 wrote:

 Hi Michael,



 Thanks so much for the reply - that really cleared a lot of things up for
 me!



 Let me just check that I've interpreted one of your suggestions for (4)
 correctly... Would it make sense for me to write

Spark SQL Parquet as External table - 1.3.x HiveMetastoreType now hidden

2015-04-06 Thread Todd Nist

In 1.2.1 of I was persisting a set of parquet files as a table for use by
spark-sql cli later on. There was a post here
http://apache-spark-user-list.1001560.n3.nabble.com/persist-table-schema-in-spark-sql-tt16297.html#a16311
by
Mchael Armbrust that provide a nice little helper method for dealing with
this:

/**
 * Sugar for creating a Hive external table from a parquet path.
 */def createParquetTable(name: String, file: String): Unit = {
  import org.apache.spark.sql.hive.HiveMetastoreTypes

  val rdd = parquetFile(file)
  val schema = rdd.schema.fields.map(f = s${f.name}
${HiveMetastoreTypes.toMetastoreType(f.dataType)}).mkString(,\n)
  val ddl = s
|CREATE EXTERNAL TABLE $name (
|  $schema
|)
|ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe'
|STORED AS INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat'
|OUTPUTFORMAT 'parquet.hive.DeprecatedParquetOutputFormat'
|LOCATION '$file'.stripMargin
  sql(ddl)
  setConf(spark.sql.hive.convertMetastoreParquet, true)
}

In migrating to 1.3.x I see that the spark.sql.hive.convertMetastoreParquet
is no longer public, so the above no longer works.

I can define a helper method that wraps the HiveMetastoreTypes something
like:

package org.apache.spark.sql.hive
import org.apache.spark.sql.types.DataType

/**
 * Helper to expose HiveMetastoreTypes hidden by Spark.  It is created
in this name space to make it accessible.
 */
object HiveTypeHelper {
  def toDataType(metastoreType: String): DataType =
HiveMetastoreTypes.toDataType(metastoreType)
  def toMetastoreType(dataType: DataType): String =
HiveMetastoreTypes.toMetastoreType(dataType)
}

While this will work, is there a better way to achieve this under 1.3.x?

TIA for the assistance.

-Todd

Re: Spark Sql - Missing Jar ? json_tuple NoClassDefFoundError

2015-04-03 Thread Todd Nist

I placed it there.  It was downloaded from MySql site.

On Fri, Apr 3, 2015 at 6:25 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:

 Akhil
 you mentioned /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar .
 how come you got this lib into spark/lib folder.
 1) did you place it there ?
 2) What is download location ?


 On Fri, Apr 3, 2015 at 3:42 PM, Todd Nist tsind...@gmail.com wrote:

 Started the spark shell with the one jar from hive suggested:

 ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-cores 2 
 --driver-class-path /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar 
 --jars /opt/apache-hive-0.13.1-bin/lib/hive-exec-0.13.1.jar

 Results in the same error:

 scala sql( | SELECT path, name, value, v1.peValue, v1.peName 
 |  FROM metric_table |lateral view 
 json_tuple(pathElements, 'name', 'value') v1 |  as peName, 
 peValue | )
 15/04/03 06:01:30 INFO ParseDriver: Parsing command: SELECT path, name, 
 value, v1.peValue, v1.peName FROM metric_table   lateral 
 view json_tuple(pathElements, 'name', 'value') v1 as peName, 
 peValue
 15/04/03 06:01:31 INFO ParseDriver: Parse Completed
 res2: org.apache.spark.sql.SchemaRDD =
 SchemaRDD[5] at RDD at SchemaRDD.scala:108== Query Plan  Physical Plan ==
 java.lang.ClassNotFoundException: json_tuple

 I will try the rebuild.  Thanks again for the assistance.

 -Todd


 On Fri, Apr 3, 2015 at 5:34 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Can you try building Spark
 https://spark.apache.org/docs/1.2.0/building-spark.html#building-with-hive-and-jdbc-support%23building-with-hive-and-jdbc-support
 with hive support? Before that try to run the following:

 ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-
 cores 2 --driver-class-path /usr/local/spark/lib/mysql-connector-java-5.
 1.34-bin.jar --jars /opt/hive/0.13.1/lib/hive-exec.jar

 Thanks
 Best Regards

 On Fri, Apr 3, 2015 at 2:55 PM, Todd Nist tsind...@gmail.com wrote:

 Hi Akhil,

 This is for version 1.2.1.  Well the other thread that you reference
 was me attempting it in 1.3.0 to see if the issue was related to 1.2.1.  I
 did not build Spark but used the version from the Spark download site for
 1.2.1 Pre Built for Hadoop 2.4 or Later.

 Since I get the error in both 1.2.1 and 1.3.0,

 15/04/01 14:41:49 INFO ParseDriver: Parse Completed Exception in
 thread main java.lang.ClassNotFoundException: json_tuple at
 java.net.URLClassLoader$1.run(

 It looks like I just don't have the jar.  Even including all jars in
 the $HIVE/lib directory did not seem to work.  Though when looking in
 $HIVE/lib for 0.13.1, I do not see any json serde or jackson files.  I do
 see that hive-exec.jar contains
 the org/apache/hadoop/hive/ql/udf/generic/GenericUDTFJSONTuple class.  Do
 you know if there is another Jar that is required or should it work just by
 including all jars from $HIVE/lib?

 I can build it locally, but did not think that was required based on
 the version I downloaded; is that not the case?

 Thanks for the assistance.

 -Todd


 On Fri, Apr 3, 2015 at 2:06 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 How did you build spark? which version of spark are you having?
 Doesn't this thread already explains it?
 https://www.mail-archive.com/user@spark.apache.org/msg25505.html

 Thanks
 Best Regards

 On Thu, Apr 2, 2015 at 11:10 PM, Todd Nist tsind...@gmail.com wrote:

 Hi Akhil,

 Tried your suggestion to no avail.  I actually to not see and
 jackson or json serde jars in the $HIVE/lib directory.  This is hive
 0.13.1 and spark 1.2.1

 Here is what I did:

 I have added the lib folder to the –jars option when starting the
 spark-shell,
 but the job fails. The hive-site.xml is in the $SPARK_HOME/conf
 directory.

 I start the spark-shell as follows:

 ./bin/spark-shell --master spark://radtech.io:7077 
 --total-executor-cores 2 --driver-class-path 
 /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar

 and like this

 ./bin/spark-shell --master spark://radtech.io:7077 
 --total-executor-cores 2 --driver-class-path 
 /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar --jars 
 /opt/hive/0.13.1/lib/*

 I’m just doing this in the spark-shell now:

 import org.apache.spark.sql.hive._val sqlContext = new 
 HiveContext(sc)import sqlContext._case class MetricTable(path: String, 
 pathElements: String, name: String, value: String)val mt = new 
 MetricTable(path: /DC1/HOST1/,
 pathElements: [{node: DataCenter,value: DC1},{node: 
 host,value: HOST1}],
 name: Memory Usage (%),
 value: 29.590943279257175)val rdd1 = sc.makeRDD(List(mt))
 rdd1.printSchema()
 rdd1.registerTempTable(metric_table)
 sql(
 SELECT path, name, value, v1.peValue, v1.peName
  FROM metric_table
lateral view json_tuple(pathElements, 'name', 'value') v1
  as peName, peValue
 )
 .collect.foreach(println(_))

 It results in the same

Re: Spark Sql - Missing Jar ? json_tuple NoClassDefFoundError

2015-04-03 Thread Todd Nist

Hi Deepujain,

I did include the jar file, I believe it is hive-exe.jar, through the
--jars option:

./bin/spark-shell --master spark://radtech.io:7077
--total-executor-cores 2 --driver-class-path
/usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar --jars
/opt/apache-hive-0.13.1-bin/lib/hive-exec-0.13.1.jar

Results in the same error.  I'm going to do the rebuild in a few minutes.

Thanks for the assistance.

-Todd



On Fri, Apr 3, 2015 at 6:30 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:

 I think you need to include the jar file through --jars option that
 contains the hive definition (code) of UDF json_tuple. That should solve
 your problem.

 On Fri, Apr 3, 2015 at 3:57 PM, Todd Nist tsind...@gmail.com wrote:

 I placed it there.  It was downloaded from MySql site.

 On Fri, Apr 3, 2015 at 6:25 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com
 wrote:

 Akhil
 you mentioned /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar
 . how come you got this lib into spark/lib folder.
 1) did you place it there ?
 2) What is download location ?


 On Fri, Apr 3, 2015 at 3:42 PM, Todd Nist tsind...@gmail.com wrote:

 Started the spark shell with the one jar from hive suggested:

 ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-cores 
 2 --driver-class-path 
 /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar --jars 
 /opt/apache-hive-0.13.1-bin/lib/hive-exec-0.13.1.jar

 Results in the same error:

 scala sql( | SELECT path, name, value, v1.peValue, v1.peName   
   |  FROM metric_table |lateral view 
 json_tuple(pathElements, 'name', 'value') v1 |  as peName, 
 peValue | )
 15/04/03 06:01:30 INFO ParseDriver: Parsing command: SELECT path, name, 
 value, v1.peValue, v1.peName FROM metric_table   lateral 
 view json_tuple(pathElements, 'name', 'value') v1 as peName, 
 peValue
 15/04/03 06:01:31 INFO ParseDriver: Parse Completed
 res2: org.apache.spark.sql.SchemaRDD =
 SchemaRDD[5] at RDD at SchemaRDD.scala:108== Query Plan  Physical Plan 
 ==
 java.lang.ClassNotFoundException: json_tuple

 I will try the rebuild.  Thanks again for the assistance.

 -Todd


 On Fri, Apr 3, 2015 at 5:34 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Can you try building Spark
 https://spark.apache.org/docs/1.2.0/building-spark.html#building-with-hive-and-jdbc-support%23building-with-hive-and-jdbc-support
 with hive support? Before that try to run the following:

 ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-
 cores 2 --driver-class-path /usr/local/spark/lib/mysql-connector-java-
 5.1.34-bin.jar --jars /opt/hive/0.13.1/lib/hive-exec.jar

 Thanks
 Best Regards

 On Fri, Apr 3, 2015 at 2:55 PM, Todd Nist tsind...@gmail.com wrote:

 Hi Akhil,

 This is for version 1.2.1.  Well the other thread that you reference
 was me attempting it in 1.3.0 to see if the issue was related to 1.2.1.  
 I
 did not build Spark but used the version from the Spark download site for
 1.2.1 Pre Built for Hadoop 2.4 or Later.

 Since I get the error in both 1.2.1 and 1.3.0,

 15/04/01 14:41:49 INFO ParseDriver: Parse Completed Exception in
 thread main java.lang.ClassNotFoundException: json_tuple at
 java.net.URLClassLoader$1.run(

 It looks like I just don't have the jar.  Even including all jars in
 the $HIVE/lib directory did not seem to work.  Though when looking in
 $HIVE/lib for 0.13.1, I do not see any json serde or jackson files.  I do
 see that hive-exec.jar contains
 the org/apache/hadoop/hive/ql/udf/generic/GenericUDTFJSONTuple class.  Do
 you know if there is another Jar that is required or should it work just 
 by
 including all jars from $HIVE/lib?

 I can build it locally, but did not think that was required based on
 the version I downloaded; is that not the case?

 Thanks for the assistance.

 -Todd


 On Fri, Apr 3, 2015 at 2:06 AM, Akhil Das ak...@sigmoidanalytics.com
  wrote:

 How did you build spark? which version of spark are you having?
 Doesn't this thread already explains it?
 https://www.mail-archive.com/user@spark.apache.org/msg25505.html

 Thanks
 Best Regards

 On Thu, Apr 2, 2015 at 11:10 PM, Todd Nist tsind...@gmail.com
 wrote:

 Hi Akhil,

 Tried your suggestion to no avail.  I actually to not see and
 jackson or json serde jars in the $HIVE/lib directory.  This is 
 hive
 0.13.1 and spark 1.2.1

 Here is what I did:

 I have added the lib folder to the –jars option when starting the
 spark-shell,
 but the job fails. The hive-site.xml is in the $SPARK_HOME/conf
 directory.

 I start the spark-shell as follows:

 ./bin/spark-shell --master spark://radtech.io:7077 
 --total-executor-cores 2 --driver-class-path 
 /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar

 and like this

 ./bin/spark-shell --master spark://radtech.io:7077 
 --total-executor-cores 2 --driver-class-path 
 /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar --jars 
 /opt/hive/0.13.1/lib/*

 I’m just doing

Re: Spark Sql - Missing Jar ? json_tuple NoClassDefFoundError

2015-04-03 Thread Todd Nist

Started the spark shell with the one jar from hive suggested:

./bin/spark-shell --master spark://radtech.io:7077
--total-executor-cores 2 --driver-class-path
/usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar --jars
/opt/apache-hive-0.13.1-bin/lib/hive-exec-0.13.1.jar

Results in the same error:

scala sql( | SELECT path, name, value, v1.peValue,
v1.peName |  FROM metric_table |lateral
view json_tuple(pathElements, 'name', 'value') v1 |
as peName, peValue | )
15/04/03 06:01:30 INFO ParseDriver: Parsing command: SELECT path,
name, value, v1.peValue, v1.peName FROM metric_table
lateral view json_tuple(pathElements, 'name', 'value') v1
as peName, peValue
15/04/03 06:01:31 INFO ParseDriver: Parse Completed
res2: org.apache.spark.sql.SchemaRDD =
SchemaRDD[5] at RDD at SchemaRDD.scala:108== Query Plan  Physical Plan ==
java.lang.ClassNotFoundException: json_tuple

I will try the rebuild.  Thanks again for the assistance.

-Todd


On Fri, Apr 3, 2015 at 5:34 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 Can you try building Spark
 https://spark.apache.org/docs/1.2.0/building-spark.html#building-with-hive-and-jdbc-support%23building-with-hive-and-jdbc-support
 with hive support? Before that try to run the following:

 ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-cores
 2 --driver-class-path /usr/local/spark/lib/mysql-connector-java-5.1.34-bin
 .jar --jars /opt/hive/0.13.1/lib/hive-exec.jar

 Thanks
 Best Regards

 On Fri, Apr 3, 2015 at 2:55 PM, Todd Nist tsind...@gmail.com wrote:

 Hi Akhil,

 This is for version 1.2.1.  Well the other thread that you reference was
 me attempting it in 1.3.0 to see if the issue was related to 1.2.1.  I did
 not build Spark but used the version from the Spark download site for 1.2.1
 Pre Built for Hadoop 2.4 or Later.

 Since I get the error in both 1.2.1 and 1.3.0,

 15/04/01 14:41:49 INFO ParseDriver: Parse Completed Exception in thread
 main java.lang.ClassNotFoundException: json_tuple at
 java.net.URLClassLoader$1.run(

 It looks like I just don't have the jar.  Even including all jars in the
 $HIVE/lib directory did not seem to work.  Though when looking in $HIVE/lib
 for 0.13.1, I do not see any json serde or jackson files.  I do see that
 hive-exec.jar contains
 the org/apache/hadoop/hive/ql/udf/generic/GenericUDTFJSONTuple class.  Do
 you know if there is another Jar that is required or should it work just by
 including all jars from $HIVE/lib?

 I can build it locally, but did not think that was required based on the
 version I downloaded; is that not the case?

 Thanks for the assistance.

 -Todd


 On Fri, Apr 3, 2015 at 2:06 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 How did you build spark? which version of spark are you having? Doesn't
 this thread already explains it?
 https://www.mail-archive.com/user@spark.apache.org/msg25505.html

 Thanks
 Best Regards

 On Thu, Apr 2, 2015 at 11:10 PM, Todd Nist tsind...@gmail.com wrote:

 Hi Akhil,

 Tried your suggestion to no avail.  I actually to not see and jackson
 or json serde jars in the $HIVE/lib directory.  This is hive 0.13.1 and
 spark 1.2.1

 Here is what I did:

 I have added the lib folder to the –jars option when starting the
 spark-shell,
 but the job fails. The hive-site.xml is in the $SPARK_HOME/conf
 directory.

 I start the spark-shell as follows:

 ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-cores 
 2 --driver-class-path 
 /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar

 and like this

 ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-cores 
 2 --driver-class-path 
 /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar --jars 
 /opt/hive/0.13.1/lib/*

 I’m just doing this in the spark-shell now:

 import org.apache.spark.sql.hive._val sqlContext = new 
 HiveContext(sc)import sqlContext._case class MetricTable(path: String, 
 pathElements: String, name: String, value: String)val mt = new 
 MetricTable(path: /DC1/HOST1/,
 pathElements: [{node: DataCenter,value: DC1},{node: 
 host,value: HOST1}],
 name: Memory Usage (%),
 value: 29.590943279257175)val rdd1 = sc.makeRDD(List(mt))
 rdd1.printSchema()
 rdd1.registerTempTable(metric_table)
 sql(
 SELECT path, name, value, v1.peValue, v1.peName
  FROM metric_table
lateral view json_tuple(pathElements, 'name', 'value') v1
  as peName, peValue
 )
 .collect.foreach(println(_))

 It results in the same error:

 15/04/02 12:33:59 INFO ParseDriver: Parsing command: SELECT path, name, 
 value, v1.peValue, v1.peName FROM metric_table   lateral 
 view json_tuple(pathElements, 'name', 'value') v1 as peName, 
 peValue
 15/04/02 12:34:00 INFO ParseDriver: Parse Completed
 res2: org.apache.spark.sql.SchemaRDD =
 SchemaRDD[5] at RDD at SchemaRDD.scala:108== Query Plan  Physical Plan 
 ==
 java.lang.ClassNotFoundException: json_tuple

Re: Tableau + Spark SQL Thrift Server + Cassandra

2015-04-03 Thread Todd Nist

What version of Cassandra are you using?  Are you using DSE or the stock
Apache Cassandra version?  I have connected it with DSE, but have not
attempted it with the standard Apache Cassandra version.

FWIW,
http://www.datastax.com/dev/blog/datastax-odbc-cql-connector-apache-cassandra-datastax-enterprise,
provides an ODBC driver tor accessing C* from Tableau.  Granted it does not
provide all the goodness of Spark.  Are you attempting to leverage the
spark-cassandra-connector for this?



On Thu, Apr 2, 2015 at 10:20 PM, Mohammed Guller moham...@glassbeam.com
wrote:

  Hi –



 Is anybody using Tableau to analyze data in Cassandra through the Spark
 SQL Thrift Server?



 Thanks!



 Mohammed

Re: Tableau + Spark SQL Thrift Server + Cassandra

2015-04-03 Thread Todd Nist

Hi Mohammed,

Not sure if you have tried this or not.  You could try using the below api
to start the thriftserver with an existing context.

https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L42

The one thing that Michael Ambrust @ databrick recommended was this:

 You can start a JDBC server with an existing context.  See my answer here:
 http://apache-spark-user-list.1001560.n3.nabble.com/Standard-SQL-tool-access-to-SchemaRDD-td20197.html

So something like this based on example from Cheng Lian:

*Server*

import  org.apache.spark.sql.hive.HiveContext
import  org.apache.spark.sql.catalyst.types._

val  sparkContext  =  sc
import  sparkContext._
val  sqlContext  =  new  HiveContext(sparkContext)
import  sqlContext._
makeRDD((1,hello) :: (2,world)
::Nil).toSchemaRDD.cache().registerTempTable(t)
// replace the above with the C* + spark-casandra-connectore to
generate SchemaRDD and registerTempTable

import  org.apache.spark.sql.hive.thriftserver._
HiveThriftServer2.startWithContext(sqlContext)

Then Startup

./bin/beeline -u jdbc:hive2://localhost:1/default
0: jdbc:hive2://localhost:1/default select * from t;


I have not tried this yet from Tableau.   My understanding is that the
tempTable is only valid as long as the sqlContext is, so if one terminates
the code representing the *Server*, and then restarts the standard thrift
server, sbin/start-thriftserver ..., the table won't be available.

Another possibility is to perhaps use the tuplejump cash project,
https://github.com/tuplejump/cash.

HTH.

-Todd

On Fri, Apr 3, 2015 at 11:11 AM, pawan kumar pkv...@gmail.com wrote:

 Thanks mohammed. Will give it a try today. We would also need the
 sparksSQL piece as we are migrating our data store from oracle to C* and it
 would be easier to maintain all the reports rather recreating each one from
 scratch.

 Thanks,
 Pawan Venugopal.
 On Apr 3, 2015 7:59 AM, Mohammed Guller moham...@glassbeam.com wrote:

  Hi Todd,



 We are using Apache C* 2.1.3, not DSE. We got Tableau to work directly
 with C* using the ODBC driver, but now would like to add Spark SQL to the
 mix. I haven’t been able to find any documentation for how to make this
 combination work.



 We are using the Spark-Cassandra-Connector in our applications, but
 haven’t been able to figure out how to get the Spark SQL Thrift Server to
 use it and connect to C*. That is the missing piece. Once we solve that
 piece of the puzzle then Tableau should be able to see the tables in C*.



 Hi Pawan,

 Tableau + C* is pretty straight forward, especially if you are using DSE.
 Create a new DSN in Tableau using the ODBC driver that comes with DSE. Once
 you connect, Tableau allows to use C* keyspace as schema and column
 families as tables.



 Mohammed



 *From:* pawan kumar [mailto:pkv...@gmail.com]
 *Sent:* Friday, April 3, 2015 7:41 AM
 *To:* Todd Nist
 *Cc:* user@spark.apache.org; Mohammed Guller
 *Subject:* Re: Tableau + Spark SQL Thrift Server + Cassandra



 Hi Todd,

 Thanks for the link. I would be interested in this solution. I am using
 DSE for cassandra. Would you provide me with info on connecting with DSE
 either through Tableau or zeppelin. The goal here is query cassandra
 through spark sql so that I could perform joins and groupby on my queries.
 Are you able to perform spark sql queries with tableau?

 Thanks,
 Pawan Venugopal

 On Apr 3, 2015 5:03 AM, Todd Nist tsind...@gmail.com wrote:

 What version of Cassandra are you using?  Are you using DSE or the stock
 Apache Cassandra version?  I have connected it with DSE, but have not
 attempted it with the standard Apache Cassandra version.



 FWIW,
 http://www.datastax.com/dev/blog/datastax-odbc-cql-connector-apache-cassandra-datastax-enterprise,
 provides an ODBC driver tor accessing C* from Tableau.  Granted it does not
 provide all the goodness of Spark.  Are you attempting to leverage the
 spark-cassandra-connector for this?







 On Thu, Apr 2, 2015 at 10:20 PM, Mohammed Guller moham...@glassbeam.com
 wrote:

 Hi –



 Is anybody using Tableau to analyze data in Cassandra through the Spark
 SQL Thrift Server?



 Thanks!



 Mohammed

Re: Tableau + Spark SQL Thrift Server + Cassandra

2015-04-03 Thread Todd Nist

@Pawan

Not sure if you have seen this or not, but here is a good example by
Jonathan Lacefield of Datastax's on hooking up sparksql with DSE, adding
Tableau is as simple as Mohammed stated with DSE.
https://github.com/jlacefie/sparksqltest.

HTH,
Todd

On Fri, Apr 3, 2015 at 2:39 PM, Todd Nist tsind...@gmail.com wrote:

 Hi Mohammed,

 Not sure if you have tried this or not.  You could try using the below api
 to start the thriftserver with an existing context.


 https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L42

 The one thing that Michael Ambrust @ databrick recommended was this:

 You can start a JDBC server with an existing context.  See my answer
 here:
 http://apache-spark-user-list.1001560.n3.nabble.com/Standard-SQL-tool-access-to-SchemaRDD-td20197.html

 So something like this based on example from Cheng Lian:

 *Server*

 import  org.apache.spark.sql.hive.HiveContext
 import  org.apache.spark.sql.catalyst.types._

 val  sparkContext  =  sc
 import  sparkContext._
 val  sqlContext  =  new  HiveContext(sparkContext)
 import  sqlContext._
 makeRDD((1,hello) :: (2,world) 
 ::Nil).toSchemaRDD.cache().registerTempTable(t)
 // replace the above with the C* + spark-casandra-connectore to generate 
 SchemaRDD and registerTempTable

 import  org.apache.spark.sql.hive.thriftserver._
 HiveThriftServer2.startWithContext(sqlContext)

 Then Startup

 ./bin/beeline -u jdbc:hive2://localhost:1/default
 0: jdbc:hive2://localhost:1/default select * from t;


 I have not tried this yet from Tableau.   My understanding is that the
 tempTable is only valid as long as the sqlContext is, so if one terminates
 the code representing the *Server*, and then restarts the standard thrift
 server, sbin/start-thriftserver ..., the table won't be available.

 Another possibility is to perhaps use the tuplejump cash project,
 https://github.com/tuplejump/cash.

 HTH.

 -Todd

 On Fri, Apr 3, 2015 at 11:11 AM, pawan kumar pkv...@gmail.com wrote:

 Thanks mohammed. Will give it a try today. We would also need the
 sparksSQL piece as we are migrating our data store from oracle to C* and it
 would be easier to maintain all the reports rather recreating each one from
 scratch.

 Thanks,
 Pawan Venugopal.
 On Apr 3, 2015 7:59 AM, Mohammed Guller moham...@glassbeam.com wrote:

  Hi Todd,



 We are using Apache C* 2.1.3, not DSE. We got Tableau to work directly
 with C* using the ODBC driver, but now would like to add Spark SQL to the
 mix. I haven’t been able to find any documentation for how to make this
 combination work.



 We are using the Spark-Cassandra-Connector in our applications, but
 haven’t been able to figure out how to get the Spark SQL Thrift Server to
 use it and connect to C*. That is the missing piece. Once we solve that
 piece of the puzzle then Tableau should be able to see the tables in C*.



 Hi Pawan,

 Tableau + C* is pretty straight forward, especially if you are using
 DSE. Create a new DSN in Tableau using the ODBC driver that comes with DSE.
 Once you connect, Tableau allows to use C* keyspace as schema and column
 families as tables.



 Mohammed



 *From:* pawan kumar [mailto:pkv...@gmail.com]
 *Sent:* Friday, April 3, 2015 7:41 AM
 *To:* Todd Nist
 *Cc:* user@spark.apache.org; Mohammed Guller
 *Subject:* Re: Tableau + Spark SQL Thrift Server + Cassandra



 Hi Todd,

 Thanks for the link. I would be interested in this solution. I am using
 DSE for cassandra. Would you provide me with info on connecting with DSE
 either through Tableau or zeppelin. The goal here is query cassandra
 through spark sql so that I could perform joins and groupby on my queries.
 Are you able to perform spark sql queries with tableau?

 Thanks,
 Pawan Venugopal

 On Apr 3, 2015 5:03 AM, Todd Nist tsind...@gmail.com wrote:

 What version of Cassandra are you using?  Are you using DSE or the stock
 Apache Cassandra version?  I have connected it with DSE, but have not
 attempted it with the standard Apache Cassandra version.



 FWIW,
 http://www.datastax.com/dev/blog/datastax-odbc-cql-connector-apache-cassandra-datastax-enterprise,
 provides an ODBC driver tor accessing C* from Tableau.  Granted it does not
 provide all the goodness of Spark.  Are you attempting to leverage the
 spark-cassandra-connector for this?







 On Thu, Apr 2, 2015 at 10:20 PM, Mohammed Guller moham...@glassbeam.com
 wrote:

 Hi –



 Is anybody using Tableau to analyze data in Cassandra through the Spark
 SQL Thrift Server?



 Thanks!



 Mohammed

Re: Tableau + Spark SQL Thrift Server + Cassandra

2015-04-03 Thread Todd Nist

@Pawan,

So it's been a couple of months since I have had a chance to do anything
with Zeppelin, but here is a link to a post on what I did to get it working
https://groups.google.com/forum/#!topic/zeppelin-developers/mCNdyOXNikI.
This may or may not work with the newer releases from Zeppelin.

-Todd

On Fri, Apr 3, 2015 at 3:02 PM, pawan kumar pkv...@gmail.com wrote:

 Hi Todd,

 Thanks for the help. So i was able to get the DSE working with tableau as
 per the link provided by Mohammed. Now i trying to figure out if i could
 write sparksql queries from tableau and get data from DSE. My end goal is
 to get a web based tool where i could write sql queries which will pull
 data from cassandra.

 With Zeppelin I was able to build and run it in EC2 but not sure if
 configurations are right. I am pointing to a spark master which is a remote
 DSE node and all spark and sparksql dependencies are in the remote node. I
 am not sure if i need to install spark and its dependencies in the webui
 (zepplene) node.

 I am not sure talking about zepplelin in this thread is right.

 Thanks once again for all the help.

 Thanks,
 Pawan Venugopal


 On Fri, Apr 3, 2015 at 11:48 AM, Todd Nist tsind...@gmail.com wrote:

 @Pawan

 Not sure if you have seen this or not, but here is a good example by
 Jonathan Lacefield of Datastax's on hooking up sparksql with DSE, adding
 Tableau is as simple as Mohammed stated with DSE.
 https://github.com/jlacefie/sparksqltest.

 HTH,
 Todd

 On Fri, Apr 3, 2015 at 2:39 PM, Todd Nist tsind...@gmail.com wrote:

 Hi Mohammed,

 Not sure if you have tried this or not.  You could try using the below
 api to start the thriftserver with an existing context.


 https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L42

 The one thing that Michael Ambrust @ databrick recommended was this:

 You can start a JDBC server with an existing context.  See my answer
 here:
 http://apache-spark-user-list.1001560.n3.nabble.com/Standard-SQL-tool-access-to-SchemaRDD-td20197.html

 So something like this based on example from Cheng Lian:

 *Server*

 import  org.apache.spark.sql.hive.HiveContext
 import  org.apache.spark.sql.catalyst.types._

 val  sparkContext  =  sc
 import  sparkContext._
 val  sqlContext  =  new  HiveContext(sparkContext)
 import  sqlContext._
 makeRDD((1,hello) :: (2,world) 
 ::Nil).toSchemaRDD.cache().registerTempTable(t)
 // replace the above with the C* + spark-casandra-connectore to generate 
 SchemaRDD and registerTempTable

 import  org.apache.spark.sql.hive.thriftserver._
 HiveThriftServer2.startWithContext(sqlContext)

 Then Startup

 ./bin/beeline -u jdbc:hive2://localhost:1/default
 0: jdbc:hive2://localhost:1/default select * from t;


 I have not tried this yet from Tableau.   My understanding is that the
 tempTable is only valid as long as the sqlContext is, so if one terminates
 the code representing the *Server*, and then restarts the standard
 thrift server, sbin/start-thriftserver ..., the table won't be available.

 Another possibility is to perhaps use the tuplejump cash project,
 https://github.com/tuplejump/cash.

 HTH.

 -Todd

 On Fri, Apr 3, 2015 at 11:11 AM, pawan kumar pkv...@gmail.com wrote:

 Thanks mohammed. Will give it a try today. We would also need the
 sparksSQL piece as we are migrating our data store from oracle to C* and it
 would be easier to maintain all the reports rather recreating each one from
 scratch.

 Thanks,
 Pawan Venugopal.
 On Apr 3, 2015 7:59 AM, Mohammed Guller moham...@glassbeam.com
 wrote:

  Hi Todd,



 We are using Apache C* 2.1.3, not DSE. We got Tableau to work directly
 with C* using the ODBC driver, but now would like to add Spark SQL to the
 mix. I haven’t been able to find any documentation for how to make this
 combination work.



 We are using the Spark-Cassandra-Connector in our applications, but
 haven’t been able to figure out how to get the Spark SQL Thrift Server to
 use it and connect to C*. That is the missing piece. Once we solve that
 piece of the puzzle then Tableau should be able to see the tables in C*.



 Hi Pawan,

 Tableau + C* is pretty straight forward, especially if you are using
 DSE. Create a new DSN in Tableau using the ODBC driver that comes with 
 DSE.
 Once you connect, Tableau allows to use C* keyspace as schema and column
 families as tables.



 Mohammed



 *From:* pawan kumar [mailto:pkv...@gmail.com]
 *Sent:* Friday, April 3, 2015 7:41 AM
 *To:* Todd Nist
 *Cc:* user@spark.apache.org; Mohammed Guller
 *Subject:* Re: Tableau + Spark SQL Thrift Server + Cassandra



 Hi Todd,

 Thanks for the link. I would be interested in this solution. I am
 using DSE for cassandra. Would you provide me with info on connecting with
 DSE either through Tableau or zeppelin. The goal here is query cassandra
 through spark sql so that I could perform joins and groupby on my queries.
 Are you able to perform spark sql

Re: Tableau + Spark SQL Thrift Server + Cassandra

2015-04-03 Thread Todd Nist

Thanks Mohammed,

I was aware of Calliope, but haven't used it since with since the
spark-cassandra-connector project got released.  I was not aware of the
CalliopeServer2; cool thanks for sharing that one.

I would appreciate it if you could lmk how you decide to proceed with this;
I can see this coming up on my radar in the next few months; thanks.

-Todd

On Fri, Apr 3, 2015 at 5:53 PM, Mohammed Guller moham...@glassbeam.com
wrote:

  Thanks, Todd.



 It is an interesting idea; worth trying.



 I think the cash project is old. The tuplejump guy has created another
 project called CalliopeServer2, which works like a charm with BI tools that
 use JDBC, but unfortunately Tableau throws an error when it connects to it.



 Mohammed



 *From:* Todd Nist [mailto:tsind...@gmail.com]
 *Sent:* Friday, April 3, 2015 11:39 AM
 *To:* pawan kumar
 *Cc:* Mohammed Guller; user@spark.apache.org

 *Subject:* Re: Tableau + Spark SQL Thrift Server + Cassandra



 Hi Mohammed,



 Not sure if you have tried this or not.  You could try using the below api
 to start the thriftserver with an existing context.


 https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L42

 The one thing that Michael Ambrust @ databrick recommended was this:

 You can start a JDBC server with an existing context.  See my answer here:
 http://apache-spark-user-list.1001560.n3.nabble.com/Standard-SQL-tool-access-to-SchemaRDD-td20197.html

 So something like this based on example from Cheng Lian:


 * Server*

 import  org.apache.spark.sql.hive.HiveContext

 import  org.apache.spark.sql.catalyst.types._



 val  sparkContext  =  sc

 import  sparkContext._

 val  sqlContext  =  new  HiveContext(sparkContext)

 import  sqlContext._

 makeRDD((1,hello) :: (2,world) 
 ::Nil).toSchemaRDD.cache().registerTempTable(t)

 // replace the above with the C* + spark-casandra-connectore to generate 
 SchemaRDD and registerTempTable



 import  org.apache.spark.sql.hive.thriftserver._

 HiveThriftServer2.startWithContext(sqlContext)

   Then Startup

 ./bin/beeline -u jdbc:hive2://localhost:1/default

 0: jdbc:hive2://localhost:1/default select * from t;



   I have not tried this yet from Tableau.   My understanding is that the
 tempTable is only valid as long as the sqlContext is, so if one terminates
 the code representing the *Server*, and then restarts the standard thrift
 server, sbin/start-thriftserver ..., the table won't be available.



 Another possibility is to perhaps use the tuplejump cash project,
 https://github.com/tuplejump/cash.



 HTH.



 -Todd



 On Fri, Apr 3, 2015 at 11:11 AM, pawan kumar pkv...@gmail.com wrote:

 Thanks mohammed. Will give it a try today. We would also need the
 sparksSQL piece as we are migrating our data store from oracle to C* and it
 would be easier to maintain all the reports rather recreating each one from
 scratch.

 Thanks,
 Pawan Venugopal.

 On Apr 3, 2015 7:59 AM, Mohammed Guller moham...@glassbeam.com wrote:

 Hi Todd,



 We are using Apache C* 2.1.3, not DSE. We got Tableau to work directly
 with C* using the ODBC driver, but now would like to add Spark SQL to the
 mix. I haven’t been able to find any documentation for how to make this
 combination work.



 We are using the Spark-Cassandra-Connector in our applications, but
 haven’t been able to figure out how to get the Spark SQL Thrift Server to
 use it and connect to C*. That is the missing piece. Once we solve that
 piece of the puzzle then Tableau should be able to see the tables in C*.



 Hi Pawan,

 Tableau + C* is pretty straight forward, especially if you are using DSE.
 Create a new DSN in Tableau using the ODBC driver that comes with DSE. Once
 you connect, Tableau allows to use C* keyspace as schema and column
 families as tables.



 Mohammed



 *From:* pawan kumar [mailto:pkv...@gmail.com]
 *Sent:* Friday, April 3, 2015 7:41 AM
 *To:* Todd Nist
 *Cc:* user@spark.apache.org; Mohammed Guller
 *Subject:* Re: Tableau + Spark SQL Thrift Server + Cassandra



 Hi Todd,

 Thanks for the link. I would be interested in this solution. I am using
 DSE for cassandra. Would you provide me with info on connecting with DSE
 either through Tableau or zeppelin. The goal here is query cassandra
 through spark sql so that I could perform joins and groupby on my queries.
 Are you able to perform spark sql queries with tableau?

 Thanks,
 Pawan Venugopal

 On Apr 3, 2015 5:03 AM, Todd Nist tsind...@gmail.com wrote:

 What version of Cassandra are you using?  Are you using DSE or the stock
 Apache Cassandra version?  I have connected it with DSE, but have not
 attempted it with the standard Apache Cassandra version.



 FWIW,
 http://www.datastax.com/dev/blog/datastax-odbc-cql-connector-apache-cassandra-datastax-enterprise,
 provides an ODBC driver tor accessing C* from Tableau.  Granted it does not
 provide all the goodness of Spark.  Are you attempting

Re: Spark SQL 1.3.0 - spark-shell error : HiveMetastoreCatalog.class refers to term cache in package com.google.common which is not available

2015-04-02 Thread Todd Nist

Hi Young,

Sorry for the duplicate post, want to reply to all.

I just downloaded the bits prebuilt form apache spark download site.
Started the spark shell and got the same error.

I then started the shell as follows:

./bin/spark-shell --master spark://radtech.io:7077 --total-executor-cores 2
--driver-class-path
/usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar --jars $(echo
~/Downloads/apache-hive-0.13.1-bin/lib/*.jar | tr ' ' ',')

this worked, or at least got rid of this

scala case class MetricTable(path: String, pathElements: String, name:
String, value: String) scala.reflect.internal.Types$TypeError: bad symbolic
reference. A signature in HiveMetastoreCatalog.class refers to term cache in
package com.google.common which is not available. It may be completely
missing from the current classpath, or the version on the classpath might
be incompatible with the version used when compiling HiveMetastoreCatalog
.class. That entry seems to have slain the compiler. Shall I replay your
session? I can re-run each line except the last one. [y/n]

Still getting the ClassNotFoundException, json_tuple, from this statement
same as in 1.2.1:

sql(
SELECT path, name, value, v1.peValue, v1.peName
 FROM metric_table
   lateral view json_tuple(pathElements, 'name', 'value') v1
 as peName, peValue
)
.collect.foreach(println(_))


15/04/02 20:50:14 INFO ParseDriver: Parsing command: SELECT path,
name, value, v1.peValue, v1.peName
 FROM metric_table
   lateral view json_tuple(pathElements, 'name', 'value') v1
 as peName, peValue15/04/02 20:50:14 INFO ParseDriver:
Parse Completed
java.lang.ClassNotFoundException: json_tuple
at 
scala.tools.nsc.interpreter.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:83)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)

Any ideas on the json_tuple exception?


Modified the syntax to take into account some minor changes in 1.3.  The
one posted this morning was from my 1.2.1 test.

import sqlContext.implicits._case class MetricTable(path: String,
pathElements: String, name: String, value: String)val mt = new
MetricTable(path: /DC1/HOST1/,
pathElements: [{node: DataCenter,value: DC1},{node:
host,value: HOST1}],
name: Memory Usage (%),
value: 29.590943279257175)val rdd1 =
sc.makeRDD(List(mt))val df = rdd1.toDF
df.printSchema
df.show
df.registerTempTable(metric_table)
sql(
SELECT path, name, value, v1.peValue, v1.peName
   FROM metric_table
  lateral view json_tuple(pathElements, 'name', 'value') v1
 as peName, peValue
)
.collect.foreach(println(_))


On Thu, Apr 2, 2015 at 8:21 PM, java8964 java8...@hotmail.com wrote:

 Hmm, I just tested my own Spark 1.3.0 build. I have the same problem, but
 I cannot reproduce it on Spark 1.2.1

 If we check the code change below:

 Spark 1.3 branch

 https://github.com/apache/spark/blob/branch-1.3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala

 vs

 Spark 1.2 branch

 https://github.com/apache/spark/blob/branch-1.2/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala

 You can see that on line 24:

 import com.google.common.cache.{CacheBuilder, CacheLoader, LoadingCache}

 is introduced on 1.3 branch.

 The error basically mean runtime com.google.common.cache package cannot be
 found in the classpath.

 Either you and me made the same mistake when we build Spark 1.3.0, or
 there are something wrong with Spark 1.3 pom.xml file.

 Here is how I built the 1.3.0:

 1) Download the spark 1.3.0 source
 2) make-distribution --targz -Dhadoop.version=1.1.1 -Phive -Phive-0.12.0
 -Phive-thriftserver -DskipTests

 Is this only due to that I built against Hadoop 1.x?

 Yong


 --
 Date: Thu, 2 Apr 2015 13:56:33 -0400
 Subject: Spark SQL 1.3.0 - spark-shell error : HiveMetastoreCatalog.class
 refers to term cache in package com.google.common which is not available
 From: tsind...@gmail.com
 To: user@spark.apache.org


 I was trying a simple test from the spark-shell to see if 1.3.0 would
 address a problem I was having with locating the json_tuple class and got
 the following error:

 scala import org.apache.spark.sql.hive._
 import org.apache.spark.sql.hive._

 scala val sqlContext = new HiveContext(sc)sqlContext: 
 org.apache.spark.sql.hive.HiveContext = 
 org.apache.spark.sql.hive.HiveContext@79c849c7

 scala import sqlContext._
 import sqlContext._

 scala case class MetricTable(path: String, pathElements: String, name: 
 String, value: String)scala.reflect.internal.Types$TypeError: bad symbolic 
 reference. A signature in HiveMetastoreCatalog.class refers to term cachein 
 package com.google.common which is not available.
 It may be completely missing from the current classpath, or the version on
 the classpath might be incompatible with the version used when compiling 
 HiveMetastoreCatalog.class.
 That entry seems to have slain

Re: Spark Sql - Missing Jar ? json_tuple NoClassDefFoundError

2015-04-02 Thread Todd Nist

Hi Akhil,

Tried your suggestion to no avail.  I actually to not see and jackson or
json serde jars in the $HIVE/lib directory.  This is hive 0.13.1 and
spark 1.2.1

Here is what I did:

I have added the lib folder to the –jars option when starting the
spark-shell,
but the job fails. The hive-site.xml is in the $SPARK_HOME/conf directory.

I start the spark-shell as follows:

./bin/spark-shell --master spark://radtech.io:7077
--total-executor-cores 2 --driver-class-path
/usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar

and like this

./bin/spark-shell --master spark://radtech.io:7077
--total-executor-cores 2 --driver-class-path
/usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar --jars
/opt/hive/0.13.1/lib/*

I’m just doing this in the spark-shell now:

import org.apache.spark.sql.hive._val sqlContext = new
HiveContext(sc)import sqlContext._case class MetricTable(path: String,
pathElements: String, name: String, value: String)val mt = new
MetricTable(path: /DC1/HOST1/,
pathElements: [{node: DataCenter,value: DC1},{node:
host,value: HOST1}],
name: Memory Usage (%),
value: 29.590943279257175)val rdd1 = sc.makeRDD(List(mt))
rdd1.printSchema()
rdd1.registerTempTable(metric_table)
sql(
SELECT path, name, value, v1.peValue, v1.peName
 FROM metric_table
   lateral view json_tuple(pathElements, 'name', 'value') v1
 as peName, peValue
)
.collect.foreach(println(_))

It results in the same error:

15/04/02 12:33:59 INFO ParseDriver: Parsing command: SELECT path,
name, value, v1.peValue, v1.peName FROM metric_table
lateral view json_tuple(pathElements, 'name', 'value') v1
as peName, peValue
15/04/02 12:34:00 INFO ParseDriver: Parse Completed
res2: org.apache.spark.sql.SchemaRDD =
SchemaRDD[5] at RDD at SchemaRDD.scala:108== Query Plan  Physical Plan ==
java.lang.ClassNotFoundException: json_tuple

Any other suggestions or am I doing something else wrong here?

-Todd



On Thu, Apr 2, 2015 at 2:00 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 Try adding all the jars in your $HIVE/lib directory. If you want the
 specific jar, you could look fr jackson or json serde in it.

 Thanks
 Best Regards

 On Thu, Apr 2, 2015 at 12:49 AM, Todd Nist tsind...@gmail.com wrote:

 I have a feeling I’m missing a Jar that provides the support or could
 this may be related to https://issues.apache.org/jira/browse/SPARK-5792.
 If it is a Jar where would I find that ? I would have thought in the
 $HIVE/lib folder, but not sure which jar contains it.

 Error:

 Create Metric Temporary Table for querying15/04/01 14:41:44 INFO 
 HiveMetaStore: 0: Opening raw store with implemenation 
 class:org.apache.hadoop.hive.metastore.ObjectStore15/04/01 14:41:44 INFO 
 ObjectStore: ObjectStore, initialize called15/04/01 14:41:45 INFO 
 Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be 
 ignored15/04/01 14:41:45 INFO Persistence: Property datanucleus.cache.level2 
 unknown - will be ignored15/04/01 14:41:45 INFO BlockManager: Removing 
 broadcast 015/04/01 14:41:45 INFO BlockManager: Removing block 
 broadcast_015/04/01 14:41:45 INFO MemoryStore: Block broadcast_0 of size 
 1272 dropped from memory (free 278018571)15/04/01 14:41:45 INFO 
 BlockManager: Removing block broadcast_0_piece015/04/01 14:41:45 INFO 
 MemoryStore: Block broadcast_0_piece0 of size 869 dropped from memory (free 
 278019440)15/04/01 14:41:45 INFO BlockManagerInfo: Removed 
 broadcast_0_piece0 on 192.168.1.5:63230 in memory (size: 869.0 B, free: 
 265.1 MB)15/04/01 14:41:45 INFO BlockManagerMaster: Updated info of block 
 broadcast_0_piece015/04/01 14:41:45 INFO BlockManagerInfo: Removed 
 broadcast_0_piece0 on 192.168.1.5:63278 in memory (size: 869.0 B, free: 
 530.0 MB)15/04/01 14:41:45 INFO ContextCleaner: Cleaned broadcast 015/04/01 
 14:41:46 INFO ObjectStore: Setting MetaStore object pin classes with 
 hive.metastore.cache.pinobjtypes=Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order15/04/01
  14:41:46 INFO Datastore: The class 
 org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as 
 embedded-only so does not have its own datastore table.15/04/01 14:41:46 
 INFO Datastore: The class org.apache.hadoop.hive.metastore.model.MOrder is 
 tagged as embedded-only so does not have its own datastore table.15/04/01 
 14:41:47 INFO Datastore: The class 
 org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as 
 embedded-only so does not have its own datastore table.15/04/01 14:41:47 
 INFO Datastore: The class org.apache.hadoop.hive.metastore.model.MOrder is 
 tagged as embedded-only so does not have its own datastore table.15/04/01 
 14:41:47 INFO Query: Reading in results for query 
 org.datanucleus.store.rdbms.query.SQLQuery@0 since the connection used is 
 closing15/04/01 14:41:47 INFO ObjectStore: Initialized ObjectStore15/04/01 
 14:41:47 INFO HiveMetaStore: Added admin role in metastore15/04/01 14:41:47 
 INFO HiveMetaStore

Spark SQL 1.3.0 - spark-shell error : HiveMetastoreCatalog.class refers to term cache in package com.google.common which is not available

2015-04-02 Thread Todd Nist

I was trying a simple test from the spark-shell to see if 1.3.0 would
address a problem I was having with locating the json_tuple class and got
the following error:

scala import org.apache.spark.sql.hive._
import org.apache.spark.sql.hive._

scala val sqlContext = new HiveContext(sc)sqlContext:
org.apache.spark.sql.hive.HiveContext =
org.apache.spark.sql.hive.HiveContext@79c849c7

scala import sqlContext._
import sqlContext._

scala case class MetricTable(path: String, pathElements: String,
name: String, value: String)scala.reflect.internal.Types$TypeError:
bad symbolic reference. A signature in HiveMetastoreCatalog.class
refers to term cachein package com.google.common which is not
available.
It may be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when
compiling HiveMetastoreCatalog.class.
That entry seems to have slain the compiler.  Shall I replay
your session? I can re-run each line except the last one.
[y/n]
Abandoning crashed session.

I entered the shell as follows:

./bin/spark-shell --master spark://radtech.io:7077
--total-executor-cores 2 --driver-class-path
/usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar

hive-site.xml looks like this:

?xml version=1.0??xml-stylesheet type=text/xsl href=configuration.xsl?
configuration
  property
namehive.semantic.analyzer.factory.impl/name
valueorg.apache.hcatalog.cli.HCatSemanticAnalyzerFactory/value
  /property

  property
namehive.metastore.sasl.enabled/name
valuefalse/value
  /property

  property
namehive.server2.authentication/name
valueNONE/value
  /property

  property
namehive.server2.enable.doAs/name
valuetrue/value
  /property

  property
namehive.warehouse.subdir.inherit.perms/name
valuetrue/value
  /property

  property
namehive.metastore.schema.verification/name
valuefalse/value
  /property

  property
namejavax.jdo.option.ConnectionURL/name

valuejdbc:mysql://localhost:3306/metastore_db?createDatabaseIfNotExist=true/value
descriptionmetadata is stored in a MySQL server/description
  /property

  property
namejavax.jdo.option.ConnectionDriverName/name
valuecom.mysql.jdbc.Driver/value
descriptionMySQL JDBC driver class/description
  /property

  property
namejavax.jdo.option.ConnectionUserName/name
value***/value
  /property

  property
namejavax.jdo.option.ConnectionPassword/name
value/value
  /property
/configuration

I have downloaded a clean version of 1.3.0 and tried it again but same
error. Is this a know issue? Or a configuration issue on my part?

TIA for the assistances.

-Todd

SparkSql - java.util.NoSuchElementException: key not found: node when access JSON Array

2015-03-31 Thread Todd Nist

I am accessing ElasticSearch via the elasticsearch-hadoop and attempting to
expose it via SparkSQL. I am using spark 1.2.1, latest supported by
elasticsearch-hadoop, and org.elasticsearch % elasticsearch-hadoop %
2.1.0.BUILD-SNAPSHOT of elasticsearch-hadoop. I’m
encountering an issue when I attempt to query the following json after
creating a temporary table from it. The json looks like this:

PUT /_template/device
{
  template: dev*,
  settings: {
number_of_shards: 1
  },
  mappings: {
metric: {
  _timestamp : {
enabled : true,
stored : true,
path : timestamp,
format : -MM-dd'T'HH:mm:ssZZ
  },
  properties: {
pathId: {
  type: string
},
pathElements: {
  properties: {
node: {
  type: string
},
value: {
  type: string
}
  }
},
name: {
  type: string
},
value: {
  type: double
},
timestamp: {
  type: date,
  store: true
}
  }
}
  }
}

Querying all columns work fine except for the pathElements which is a json
array. If this is added to the select it fails with
ajava.util.NoSuchElementException:
key not found: node.

*Details*.

The program is pretty basic, looks like this:

/**
 * A simple sample to read and write to ES using elasticsearch-hadoop.
 */

package com.opsdatastore.elasticsearch.spark

import java.io.File


// Scala imports
import scala.collection.JavaConversions._
// Spark imports
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.SparkContext._

import org.apache.spark.rdd.RDD

import org.apache.spark.sql.{SchemaRDD,SQLContext}

// ES imports
import org.elasticsearch.spark._
import org.elasticsearch.spark.sql._

// OpsDataStore
import com.opsdatastore.spark.utils.{Settings, Spark, ElasticSearch}

object ElasticSearchReadWrite {

  /**
   * Spark specific configuration
   */
  def sparkInit(): SparkContext = {
val conf = new SparkConf().setAppName(Spark.AppName).setMaster(Spark.Master)
conf.set(es.nodes, ElasticSearch.Nodes)
conf.set(es.port, ElasticSearch.HttpPort.toString())
conf.set(es.index.auto.create, true);
conf.set(spark.serializer, org.apache.spark.serializer.KryoSerializer);
conf.set(spark.executor.memory,1g)
conf.set(spark.kryoserializer.buffer.mb,256)

val sparkContext = new SparkContext(conf)
sparkContext.addJar(Spark.JarPath + jar))
sparkContext
  }


  def main(args: Array[String]) {

val sc = sparkInit

val sqlContext = new SQLContext(sc)
import sqlContext._

val start = System.currentTimeMillis()

// specific query, just read all for now
sc.esRDD(s${ElasticSearch.Index}/${ElasticSearch.Type}, ?q=*:*)

/*
 * Read from ES and provide some insight with Spark  SparkSQL
 */
val esData = sc.esRDD(device/metric)

esData.collect.foreach(println(_))

val end = System.currentTimeMillis()
println(sTotal time: ${end-start} ms)

println(Create Metric Temporary Table for querying)
val schemaRDD = sqlContext.sql(
  CREATE TEMPORARY TABLE metric  +
  USING org.elasticsearch.spark.sql  +
  OPTIONS (resource 'device/metric')   )

System.out.println()
System.out.println(#  Scheam Definition   #)
System.out.println()
schemaRDD.printSchema()

System.out.println()
System.out.println(#  Data from SparkSQL  #)
System.out.println()

sqlContext.sql(SELECT path, pathElements, `timestamp`, name,
value FROM metric).collect.foreach(println(_))
  }
}

So this works fine:

sc.esRDD(**device/metric)
esData.collect.foreach(println(_))

And results in this:

15/03/31 14:37:48 INFO DAGScheduler: Job 0 finished: collect at
ElasticSearchReadWrite.scala:67, took 4.948556 s
(AUxxDrs4cgadF5SlaMg0,Map(pathElements - Buffer(Map(node - State,
value - PA), Map(node - City, value - Pittsburgh), Map(node -
Street, value - 12345 Westbrook Drive), Map(node - level, value -
main), Map(node - device, value - thermostat)), value -
29.590943279257175, name - Current Temperature, timestamp -
2015-03-27T14:53:46+, path - /PA/Pittsburgh/12345 Westbrook
Drive/main/theromostat-1))

Yet this fails:

sqlContext.sql(SELECT path, pathElements, `timestamp`, name, value
FROM metric).collect.foreach(println(_))

With this exception:

Create Metric Temporary Table for
querying#  Scheam
Definition   #
root
#  Data from SparkSQL
#15/03/31 14:37:49
INFO BlockManager: Removing broadcast 015/03/31 14:37:49 INFO
BlockManager:

Re: Query REST web service with Spark?

2015-03-31 Thread Todd Nist

Here are a few ways to achieve what your loolking to do:

https://github.com/cjnolet/spark-jetty-server

Spark Job Server - https://github.com/spark-jobserver/spark-jobserver -

defines a REST API for Spark

Hue -

http://gethue.com/get-started-with-spark-deploy-spark-server-and-compute-pi-from-your-web-browser/

Spark Kernel project: https://github.com/ibm-et/spark-kernel

The Spark Kernel's goal is to serve as the foundation for interactive
applications. The project provides a client library in Scala that abstracts
connecting to the kernel (containing a Spark Context), which can be
embedded into a web application. We demonstrated this at StataConf when we
embedded the Spark Kernel client into a Play application to provide an
interactive web application that communicates to Spark via the Spark Kernel
(hosting a SparkContext).

Hopefully one of those will give you what your looking for.

-Todd

On Tue, Mar 31, 2015 at 5:06 PM, Burak Yavuz brk...@gmail.com wrote:

Hi,

If I recall correctly, I've read people integrating REST calls to Spark
Streaming jobs in the user list. I don't imagine any cases for why it
shouldn't be possible.

Best,
Burak

On Tue, Mar 31, 2015 at 1:46 PM, Minnow Noir minnown...@gmail.com wrote:

We have have some data on Hadoop that needs augmented with data only
available to us via a REST service. We're using Spark to search for, and
correct, missing data. Even though there are a lot of records to scour for
missing data, the total number of calls to the service is expected to be
low, so it would be ideal to do the whole job in Spark as we scour the data.

I don't see anything obvious in the API or on Google relating to making
REST calls from a Spark job. Is it possible?

Thanks,

Alec

Re: SparkSql - java.util.NoSuchElementException: key not found: node when access JSON Array

2015-03-31 Thread Todd Nist

So in looking at this a bit more, I gather the root cause is the fact that
the nested fields are represented as rows within rows, is that correct?  If
I don't know the size of the json array (it varies), using
x.getAs[Row](0).getString(0) is not really a valid solution.

Is the solution to apply a lateral view + explode to this?

I have attempted to change to a lateral view, but looks like my syntax is
off:

sqlContext.sql(
SELECT path,`timestamp`, name, value, pe.value FROM metric
 lateral view explode(pathElements) a AS pe)
.collect.foreach(println(_))
Which results in:

15/03/31 17:38:34 INFO ContextCleaner: Cleaned broadcast 0
Exception in thread main java.lang.RuntimeException: [1.68] failure:
``UNION'' expected but identifier view found

SELECT path,`timestamp`, name, value, pe.value FROM metric lateral
view explode(pathElements) a AS pe
   ^
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:33)
at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79)
at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79)
at 
org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:174)
at 
org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:173)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
at 
scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:31)
at 
org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:83)
at 
org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:83)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:83)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:303)
at 
com.opsdatastore.elasticsearch.spark.ElasticSearchReadWrite$.main(ElasticSearchReadWrite.scala:97)
at 
com.opsdatastore.elasticsearch.spark.ElasticSearchReadWrite.main(ElasticSearchReadWrite.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Is this the right approach?  Is this syntax available in 1.2.1:

SELECT
  v1.name, v2.city, v2.state
FROM people
  LATERAL VIEW json_tuple(people.jsonObject, 'name', 'address') v1
 as name, address
  LATERAL VIEW json_tuple(v1.address, 'city', 'state') v2
 as city, state;


-Todd

On Tue, Mar 31, 2015 at 3:26 PM, Todd Nist tsind...@gmail.com wrote:

 I am accessing ElasticSearch via the elasticsearch-hadoop and attempting
 to expose it via SparkSQL. I am using spark 1.2.1, latest supported by
 elasticsearch-hadoop, and org.elasticsearch % elasticsearch-hadoop %
 2.1.0.BUILD-SNAPSHOT of elasticsearch-hadoop. I’m
 encountering an issue when I attempt to query the following json after
 creating a temporary table from it. The json looks like this:

 PUT /_template/device
 {
   template: dev

Re: Spark as a service

2015-03-24 Thread Todd Nist

Perhaps this project, https://github.com/calrissian/spark-jetty-server,
could help with your requirements.

On Tue, Mar 24, 2015 at 7:12 AM, Jeffrey Jedele jeffrey.jed...@gmail.com
wrote:

 I don't think there's are general approach to that - the usecases are just
 to different. If you really need it, you probably will have to implement
 yourself in the driver of your application.

 PS: Make sure to use the reply to all button so that the mailing list is
 included in your reply. Otherwise only I will get your mail.

 Regards,
 Jeff

 2015-03-24 12:01 GMT+01:00 Ashish Mukherjee ashish.mukher...@gmail.com:

 Hi Jeffrey,

 Thanks. Yes, this resolves the SQL problem. My bad - I was looking for
 something which would work for Spark Streaming and other Spark jobs too,
 not just SQL.

 Regards,
 Ashish

 On Tue, Mar 24, 2015 at 4:07 PM, Jeffrey Jedele jeffrey.jed...@gmail.com
  wrote:

 Hi Ashish,
 this might be what you're looking for:


 https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbcodbc-server

 Regards,
 Jeff

 2015-03-24 11:28 GMT+01:00 Ashish Mukherjee ashish.mukher...@gmail.com
 :

 Hello,

 As of now, if I have to execute a Spark job, I need to create a jar and
 deploy it.  If I need to run a dynamically formed SQL from a Web
 application, is there any way of using SparkSQL in this manner? Perhaps,
 through a Web Service or something similar.

 Regards,
 Ashish

Re: [SQL] Elasticsearch-hadoop, exception creating temporary table

2015-03-19 Thread Todd Nist

Thanks for the assistance, I found the error it wan something I had donep;
PEBCAK.  I had placed a version of the elasticsearch-hadoop.2.1.0.BETA3 in
the project/lib directory causing it to be managed dependency and being
brought in first, even though the build.sbt had the correct version
specified, 2.1.0.BUILD-SNAPSHOT

No reason for it to bet there at all and something I don't usually do.

Thanks aging for point out the fact that it was a version mismatch issue.

-Todd

On Wed, Mar 18, 2015 at 9:59 PM, Cheng, Hao hao.ch...@intel.com wrote:

  Todd, can you try run the code in Spark shell (bin/spark-shell), maybe
 you need to write some fake code to call the function in MappingUtils
 .scala, in the meantime, can you also check the jar dependencies tree of
 your project? Or the download dependency jar files, just in case multiple
 versions of spark has been introduced.



 *From:* Todd Nist [mailto:tsind...@gmail.com]
 *Sent:* Thursday, March 19, 2015 9:04 AM
 *To:* Cheng, Hao
 *Cc:* user@spark.apache.org
 *Subject:* Re: [SQL] Elasticsearch-hadoop, exception creating temporary
 table



 Thanks for the quick response.

 The spark server is spark-1.2.1-bin-hadoop2.4 from the Spark download.
 Here is the startup:

 radtech$ ./sbin/start-master.sh

 starting org.apache.spark.deploy.master.Master, logging *to* 
 /usr/local/spark-1.2.1-bin-hadoop2.4/sbin/../logs/spark-tnist-org.apache.spark.deploy.master.Master-1-radtech.io.*out*



 Spark *assembly* *has* been built *with* Hive, including Datanucleus jars 
 *on* classpath

 Spark Command: java -cp 
 ::/usr/local/spark-1.2.1-bin-hadoop2.4/sbin/../conf:/usr/local/spark-1.2.1-bin-hadoop2.4/lib/spark-*assembly*-1.2.1-hadoop2.4.0.jar:/usr/local/spark-1.2.1-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar:/usr/local/spark-1.2.1-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar:/usr/local/spark-1.2.1-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar
  -Dspark.akka.logLifecycleEvents=*true* -Xms512m -Xmx512m 
 org.apache.spark.deploy.master.Master --ip radtech.io --port 7077 
 --webui-port 8080

 



 15/03/18 20:31:40 INFO Master: Registered signal handlers *for* [TERM, HUP, 
 INT]

 15/03/18 20:31:40 INFO SecurityManager: Changing view acls *to*: tnist

 15/03/18 20:31:40 INFO SecurityManager: Changing modify acls *to*: tnist

 15/03/18 20:31:40 INFO SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users *with* view permissions: *Set*(tnist); 
 users *with* modify permissions: *Set*(tnist)

 15/03/18 20:31:41 INFO Slf4jLogger: Slf4jLogger started

 15/03/18 20:31:41 INFO Remoting: Starting remoting

 15/03/18 20:31:41 INFO Remoting: Remoting started; listening *on* addresses 
 :[akka.tcp://sparkmas...@radtech.io:7077]

 15/03/18 20:31:41 INFO Remoting: Remoting now listens *on* addresses: 
 [akka.tcp://sparkmas...@radtech.io:7077]

 15/03/18 20:31:41 INFO Utils: Successfully started service 'sparkMaster' *on* 
 port 7077.

 15/03/18 20:31:41 INFO Master: Starting Spark master at 
 spark://radtech.io:7077

 15/03/18 20:31:41 INFO Utils: Successfully started service 'MasterUI' *on* 
 port 8080.

 15/03/18 20:31:41 INFO MasterWebUI: Started MasterWebUI at 
 http://192.168.1.5:8080

 15/03/18 20:31:41 INFO Master: I have been elected leader! *New* state: ALIVE

  My build.sbt for the spark job is as follows:

 import AssemblyKeys._



 // activating assembly plugin

 assemblySettings



 name := elasticsearch-spark



 *version* := 0.0.1



 val SCALA_VERSION = 2.10.4



 val SPARK_VERSION = 1.2.1



 val defaultSettings = Defaults.coreDefaultSettings ++ Seq(

   organization := io.radtec,

   scalaVersion := SCALA_VERSION,

   resolvers := Seq(

 //ods-repo at http://artifactory.ods:8082/artifactory/repo;,

 Resolver.typesafeRepo(releases)),

   scalacOptions ++= Seq(

 -unchecked,

 -deprecation,

 -Xlint,

 -Ywarn-dead-code,

 -language:_,

 -target:jvm-1.7,

 -encoding,

 UTF-8

   ),

   parallelExecution in Test := false,

   testOptions += Tests.Argument(TestFrameworks.JUnit, -v),

   publishArtifact in (Test, packageBin) := true,

   unmanagedSourceDirectories in Compile = (scalaSource in Compile)(Seq(_)),

   unmanagedSourceDirectories in Test = (scalaSource in Test)(Seq(_)),

   EclipseKeys.createSrc := EclipseCreateSrc.Default + 
 EclipseCreateSrc.Resource,

   credentials += Credentials(Path.userHome / .ivy2 / .credentials),

   publishTo := Some(Artifactory Realm *at* 
 http://artifactory.ods:8082/artifactory/ivy-repo-local;)

 )



 // custom Hadoop client, configured as provided, since it shouldn't go to 
 assembly jar

 val hadoopDeps = Seq (

   org.apache.hadoop % hadoop-client % 2.6.0 % provided

 )



 // ElasticSearch Hadoop support

 val esHadoopDeps = Seq (

   (org.elasticsearch % elasticsearch-hadoop % 2.1.0.BUILD-SNAPSHOT).

 exclude(org.apache.spark, spark-core_2.10).

 exclude(org.apache.spark, spark-streaming_2.10).

 exclude

1 2 >

1 - 100 of 134 matches

Mail list logo