Re: Thanks For a Job Well Done !!!

2016-06-18 Thread Reynold Xin
Thanks for the kind words, Krishna! Please keep the feedback coming.

On Saturday, June 18, 2016, Krishna Sankar  wrote:

> Hi all,
>Just wanted to thank all for the dataset API - most of the times we see
> only bugs in these lists ;o).
>
>- Putting some context, this weekend I was updating the SQL chapters
>of my book - it had all the ugliness of SchemaRDD,
>registerTempTable, take(10).foreach(println)
>and take(30).foreach(e=>println("%15s | %9.2f |".format(e(0),e(1 ;o)
>- I remember Hossein Falaki chiding me about the ugly println
>   statements !
>   - Took me a little while to grok the dataset, sparksession,
>   
> spark.read.option("header","true").option("inferSchema","true").csv(...) et
>   al.
>  - I am a big R fan and know the language pretty decent - so the
>  constructs are familiar
>   - Once I got it ( I am sure still there are more mysteries to
>uncover ...) it was just beautiful - well done folks !!!
>- One sees the contrast a lot better while teaching or writing books,
>because one has to think thru the old, the new and the transitional arc
>   - I even remember the good old days when we were discussing whether
>   Spark would get the dataframes like R at one of Paco's sessions !
>   - And now, it looks very decent for data wrangling.
>
> Cheers & keep up the good work
> 
> P.S: My next chapter is the MLlib - need to convert to ml. Should be
> interesting ... I am a glutton for punishment - of the Spark kind, of
> course !
>


Re: How to cause a stage to fail (using spark-shell)?

2016-06-18 Thread Burak Yavuz
Hi Jacek,

Can't you simply have a mapPartitions task throw an exception or something?
Are you trying to do something more esoteric?

Best,
Burak

On Sat, Jun 18, 2016 at 5:35 AM, Jacek Laskowski  wrote:

> Hi,
>
> Following up on this question, is a stage considered failed only when
> there is a FetchFailed exception? Can I have a failed stage with only
> a single-stage job?
>
> Appreciate any help on this...(as my family doesn't like me spending
> the weekend with Spark :))
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Sat, Jun 18, 2016 at 11:53 AM, Jacek Laskowski  wrote:
> > Hi,
> >
> > I'm trying to see some stats about failing stages in web UI and want
> > to "create" few failed stages. Is this possible using spark-shell at
> > all? Which setup of Spark/spark-shell would allow for such a scenario.
> >
> > I could write a Scala code if that's the only way to have failing stages.
> >
> > Please guide. Thanks.
> >
> > /me on to reviewing the Spark code...
> >
> > Pozdrawiam,
> > Jacek Laskowski
> > 
> > https://medium.com/@jaceklaskowski/
> > Mastering Apache Spark http://bit.ly/mastering-apache-spark
> > Follow me at https://twitter.com/jaceklaskowski
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Thanks For a Job Well Done !!!

2016-06-18 Thread Krishna Sankar
Hi all,
   Just wanted to thank all for the dataset API - most of the times we see
only bugs in these lists ;o).

   - Putting some context, this weekend I was updating the SQL chapters of
   my book - it had all the ugliness of SchemaRDD,
   registerTempTable, take(10).foreach(println)
   and take(30).foreach(e=>println("%15s | %9.2f |".format(e(0),e(1 ;o)
   - I remember Hossein Falaki chiding me about the ugly println statements
  !
  - Took me a little while to grok the dataset, sparksession,
  spark.read.option("header","true").option("inferSchema","true").csv(...)
et
  al.
 - I am a big R fan and know the language pretty decent - so the
 constructs are familiar
  - Once I got it ( I am sure still there are more mysteries to uncover
   ...) it was just beautiful - well done folks !!!
   - One sees the contrast a lot better while teaching or writing books,
   because one has to think thru the old, the new and the transitional arc
  - I even remember the good old days when we were discussing whether
  Spark would get the dataframes like R at one of Paco's sessions !
  - And now, it looks very decent for data wrangling.

Cheers & keep up the good work

P.S: My next chapter is the MLlib - need to convert to ml. Should be
interesting ... I am a glutton for punishment - of the Spark kind, of
course !


Re: Spark not using all the cluster instances in AWS EMR

2016-06-18 Thread Akhil Das
spark.executor.instances is the parameter that you are looking for. Read
more here http://spark.apache.org/docs/latest/running-on-yarn.html

On Sun, Jun 19, 2016 at 2:17 AM, Natu Lauchande 
wrote:

> Hi,
>
> I am running some spark loads . I notice that in  it only uses one of the
> machines(instead of the 3 available) of the cluster.
>
> Is there any parameter that can be set to force it to use all the cluster.
>
> I am using AWS EMR with Yarn.
>
>
> Thanks,
> Natu
>
>
>
>
>
>
>


-- 
Cheers!


Re: spark streaming - how to purge old data files in data directory

2016-06-18 Thread Akhil Das
Currently, there is no out of the box solution for this. Although, you can
use other hdfs utils to remove older files from the directory (say 24hrs
old). Another approach is discussed here

.

On Sun, Jun 19, 2016 at 7:28 AM, Vamsi Krishna 
wrote:

> Hi,
>
> I'm on HDP 2.3.2 cluster (Spark 1.4.1).
> I have a spark streaming app which uses 'textFileStream' to stream simple
> CSV files and process.
> I see the old data files that are processed are left in the data directory.
> What is the right way to purge the old data files in data directory on
> HDFS?
>
> Thanks,
> Vamsi Attluri
> --
> Vamsi Attluri
>



-- 
Cheers!


Re: Running JavaBased Implementationof StreamingKmeans

2016-06-18 Thread Akhil Das
SparkStreaming does not pick up old files by default, so you need to start
your job with master=local[2] (It needs 2 or more working threads, 1 to
read the files and the other to do your computation) and once the job start
to run, place your input files in the input directories and you can see
them being picked up by sparkstreaming.

On Sun, Jun 19, 2016 at 12:37 AM, Biplob Biswas 
wrote:

> Hi,
>
> I tried local[*] and local[2] and the result is the same. I don't really
> understand the problem here.
> How can I confirm that the files are read properly?
>
> Thanks & Regards
> Biplob Biswas
>
> On Sat, Jun 18, 2016 at 5:59 PM, Akhil Das  wrote:
>
>> Looks like you need to set your master to local[2] or local[*]
>>
>> On Sat, Jun 18, 2016 at 4:54 PM, Biplob Biswas 
>> wrote:
>>
>>> Hi,
>>>
>>> I implemented the streamingKmeans example provided in the spark website
>>> but
>>> in Java.
>>> The full implementation is here,
>>>
>>> http://pastebin.com/CJQfWNvk
>>>
>>> But i am not getting anything in the output except occasional timestamps
>>> like one below:
>>>
>>> ---
>>> Time: 1466176935000 ms
>>> ---
>>>
>>> Also, i have 2 directories:
>>> "D:\spark\streaming example\Data Sets\training"
>>> "D:\spark\streaming example\Data Sets\test"
>>>
>>> and inside these directories i have 1 file each "samplegpsdata_train.txt"
>>> and "samplegpsdata_test.txt" with training data having 500 datapoints and
>>> test data with 60 datapoints.
>>>
>>> I am very new to the spark systems and any help is highly appreciated.
>>>
>>> Thank you so much
>>> Biplob Biswas
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Running-JavaBased-Implementationof-StreamingKmeans-tp27192.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>>
>> --
>> Cheers!
>>
>>
>


-- 
Cheers!


spark streaming - how to purge old data files in data directory

2016-06-18 Thread Vamsi Krishna
Hi,

I'm on HDP 2.3.2 cluster (Spark 1.4.1).
I have a spark streaming app which uses 'textFileStream' to stream simple
CSV files and process.
I see the old data files that are processed are left in the data directory.
What is the right way to purge the old data files in data directory on HDFS?

Thanks,
Vamsi Attluri
-- 
Vamsi Attluri


Re: Creating tables for JSON data

2016-06-18 Thread brendan kehoe
Hello

I downloaded Apache Spark pre built for Hadoop 2.6
.
When I create a table, an empty directory with the same name is created
in /user/hive/warehouse. I created tables with the following kind of
statement:

> create table aTable (aColumn string)


When I place text files in the directory eg.
/user/hive/warehouse/atable/text-file, I can query the contents with
"select * from aTable" for example. When I create a table with the
following I can only query the specified file (/path/to/json/file):

> CREATE TABLE jsonTable USING org.apache.spark.sql.json OPTIONS ( path
> "/path/to/json/file" )


A directory eg. /user/hive/warehouse/jsontable is created, but if I put
files in there queries do not access the contents of those files. Is this
related to managed versus external tables or why is this?

Tables created with USING org.apache.spark.sql.json... are external tables
and tables created by specifying columns are managed. How do you make a
managed table in the same way the external tables are created above ie.
without specifying columns and instead creating columns based on JSON
content? I would expect queries on the managed table to give access to data
in files after the files are put in the managed table directory as I have
seen on managed tables I have created so far.

Thanks very much

Brendan


Re: Improving performance of a kafka spark streaming app

2016-06-18 Thread Colin Kincaid Williams
Hi Mich again,

Regarding batch window, etc. I have provided the sources, but I'm not
currently calling the window function. Did you see the program source?
It's only 100 lines.

https://gist.github.com/drocsid/b0efa4ff6ff4a7c3c8bb56767d0b6877

Then I would expect I'm using defaults, other than what has been shown
in the configuration.

For example:

In the launcher configuration I set --conf
spark.streaming.kafka.maxRatePerPartition=500 \ and I believe there
are 500 messages for the duration set in the application:
JavaStreamingContext jssc = new JavaStreamingContext(jsc, new
Duration(1000));


Then with the --num-executors 6 \ submit flag, and the
spark.streaming.kafka.maxRatePerPartition=500 I think that's how we
arrive at the 3000 events per batch in the UI, pasted above.

Feel free to correct me if I'm wrong.

Then are you suggesting that I set the window?

Maybe following this as reference:

https://databricks.gitbooks.io/databricks-spark-reference-applications/content/logs_analyzer/chapter1/windows.html

On Sat, Jun 18, 2016 at 8:08 PM, Mich Talebzadeh
 wrote:
> Ok
>
> What is the set up for these please?
>
> batch window
> window length
> sliding interval
>
> And also in each batch window how much data do you get in (no of messages in
> the topic whatever)?
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
>
> On 18 June 2016 at 21:01, Mich Talebzadeh  wrote:
>>
>> I believe you have an issue with performance?
>>
>> have you checked spark GUI (default 4040) for details including shuffles
>> etc?
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>>
>> On 18 June 2016 at 20:59, Colin Kincaid Williams  wrote:
>>>
>>> There are 25 nodes in the spark cluster.
>>>
>>> On Sat, Jun 18, 2016 at 7:53 PM, Mich Talebzadeh
>>>  wrote:
>>> > how many nodes are in your cluster?
>>> >
>>> > --num-executors 6 \
>>> >  --driver-memory 4G \
>>> >  --executor-memory 2G \
>>> >  --total-executor-cores 12 \
>>> >
>>> >
>>> > Dr Mich Talebzadeh
>>> >
>>> >
>>> >
>>> > LinkedIn
>>> >
>>> > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> >
>>> >
>>> >
>>> > http://talebzadehmich.wordpress.com
>>> >
>>> >
>>> >
>>> >
>>> > On 18 June 2016 at 20:40, Colin Kincaid Williams 
>>> > wrote:
>>> >>
>>> >> I updated my app to Spark 1.5.2 streaming so that it consumes from
>>> >> Kafka using the direct api and inserts content into an hbase cluster,
>>> >> as described in this thread. I was away from this project for awhile
>>> >> due to events in my family.
>>> >>
>>> >> Currently my scheduling delay is high, but the processing time is
>>> >> stable around a second. I changed my setup to use 6 kafka partitions
>>> >> on a set of smaller kafka brokers, with fewer disks. I've included
>>> >> some details below, including the script I use to launch the
>>> >> application. I'm using a Spark on Hbase library, whose version is
>>> >> relevant to my Hbase cluster. Is it apparent there is something wrong
>>> >> with my launch method that could be causing the delay, related to the
>>> >> included jars?
>>> >>
>>> >> Or is there something wrong with the very simple approach I'm taking
>>> >> for the application?
>>> >>
>>> >> Any advice is appriciated.
>>> >>
>>> >>
>>> >> The application:
>>> >>
>>> >> https://gist.github.com/drocsid/b0efa4ff6ff4a7c3c8bb56767d0b6877
>>> >>
>>> >>
>>> >> From the streaming UI I get something like:
>>> >>
>>> >> table Completed Batches (last 1000 out of 27136)
>>> >>
>>> >>
>>> >> Batch Time Input Size Scheduling Delay (?) Processing Time (?) Total
>>> >> Delay (?) Output Ops: Succeeded/Total
>>> >>
>>> >> 2016/06/18 11:21:32 3000 events 1.2 h 1 s 1.2 h 1/1
>>> >>
>>> >> 2016/06/18 11:21:31 3000 events 1.2 h 1 s 1.2 h 1/1
>>> >>
>>> >> 2016/06/18 11:21:30 3000 events 1.2 h 1 s 1.2 h 1/1
>>> >>
>>> >>
>>> >> Here's how I'm launching the spark application.
>>> >>
>>> >>
>>> >> #!/usr/bin/env bash
>>> >>
>>> >> export SPARK_CONF_DIR=/home/colin.williams/spark
>>> >>
>>> >> export HADOOP_CONF_DIR=/etc/hadoop/conf
>>> >>
>>> >> export
>>> >>
>>> >> HADOOP_CLASSPATH=/home/colin.williams/hbase/conf/:/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/*:/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/hbase-protocol-0.98.6-cdh5.3.0.jar
>>> >>
>>> >>
>>> >> /opt/spark-1.5.2-bin-hadoop2.4/bin/spark-submit \
>>> >>
>>> >> --class com.example.KafkaToHbase \
>>> >>
>>> >> --master spark://spark_master:7077 \
>>> >>
>>> >> --deploy-mode client \
>>> >>
>>> >> --num-executors 6 \
>>> >>
>>> >> --driver-memory 4G \
>>> >>
>>> >> --executor-memory 2G \
>>> >>
>>> >> 

Re: Improving performance of a kafka spark streaming app

2016-06-18 Thread Mich Talebzadeh
Ok

What is the set up for these please?

batch window
window length
sliding interval

And also in each batch window how much data do you get in (no of messages
in the topic whatever)?




Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 18 June 2016 at 21:01, Mich Talebzadeh  wrote:

> I believe you have an issue with performance?
>
> have you checked spark GUI (default 4040) for details including shuffles
> etc?
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 18 June 2016 at 20:59, Colin Kincaid Williams  wrote:
>
>> There are 25 nodes in the spark cluster.
>>
>> On Sat, Jun 18, 2016 at 7:53 PM, Mich Talebzadeh
>>  wrote:
>> > how many nodes are in your cluster?
>> >
>> > --num-executors 6 \
>> >  --driver-memory 4G \
>> >  --executor-memory 2G \
>> >  --total-executor-cores 12 \
>> >
>> >
>> > Dr Mich Talebzadeh
>> >
>> >
>> >
>> > LinkedIn
>> >
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> >
>> >
>> >
>> > http://talebzadehmich.wordpress.com
>> >
>> >
>> >
>> >
>> > On 18 June 2016 at 20:40, Colin Kincaid Williams 
>> wrote:
>> >>
>> >> I updated my app to Spark 1.5.2 streaming so that it consumes from
>> >> Kafka using the direct api and inserts content into an hbase cluster,
>> >> as described in this thread. I was away from this project for awhile
>> >> due to events in my family.
>> >>
>> >> Currently my scheduling delay is high, but the processing time is
>> >> stable around a second. I changed my setup to use 6 kafka partitions
>> >> on a set of smaller kafka brokers, with fewer disks. I've included
>> >> some details below, including the script I use to launch the
>> >> application. I'm using a Spark on Hbase library, whose version is
>> >> relevant to my Hbase cluster. Is it apparent there is something wrong
>> >> with my launch method that could be causing the delay, related to the
>> >> included jars?
>> >>
>> >> Or is there something wrong with the very simple approach I'm taking
>> >> for the application?
>> >>
>> >> Any advice is appriciated.
>> >>
>> >>
>> >> The application:
>> >>
>> >> https://gist.github.com/drocsid/b0efa4ff6ff4a7c3c8bb56767d0b6877
>> >>
>> >>
>> >> From the streaming UI I get something like:
>> >>
>> >> table Completed Batches (last 1000 out of 27136)
>> >>
>> >>
>> >> Batch Time Input Size Scheduling Delay (?) Processing Time (?) Total
>> >> Delay (?) Output Ops: Succeeded/Total
>> >>
>> >> 2016/06/18 11:21:32 3000 events 1.2 h 1 s 1.2 h 1/1
>> >>
>> >> 2016/06/18 11:21:31 3000 events 1.2 h 1 s 1.2 h 1/1
>> >>
>> >> 2016/06/18 11:21:30 3000 events 1.2 h 1 s 1.2 h 1/1
>> >>
>> >>
>> >> Here's how I'm launching the spark application.
>> >>
>> >>
>> >> #!/usr/bin/env bash
>> >>
>> >> export SPARK_CONF_DIR=/home/colin.williams/spark
>> >>
>> >> export HADOOP_CONF_DIR=/etc/hadoop/conf
>> >>
>> >> export
>> >>
>> HADOOP_CLASSPATH=/home/colin.williams/hbase/conf/:/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/*:/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/hbase-protocol-0.98.6-cdh5.3.0.jar
>> >>
>> >>
>> >> /opt/spark-1.5.2-bin-hadoop2.4/bin/spark-submit \
>> >>
>> >> --class com.example.KafkaToHbase \
>> >>
>> >> --master spark://spark_master:7077 \
>> >>
>> >> --deploy-mode client \
>> >>
>> >> --num-executors 6 \
>> >>
>> >> --driver-memory 4G \
>> >>
>> >> --executor-memory 2G \
>> >>
>> >> --total-executor-cores 12 \
>> >>
>> >> --jars
>> >>
>> /home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/zookeeper/zookeeper-3.4.5-cdh5.3.0.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/guava-12.0.1.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/protobuf-java-2.5.0.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-protocol.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-client.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-common.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-hadoop2-compat.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-hadoop-compat.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-server.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/htrace-core.jar
>> >> \
>> >>
>> >> --conf spark.app.name="Kafka To Hbase" \
>> >>
>> >> --conf spark.eventLog.dir="hdfs:///user/spark/applicationHistory" \
>> >>
>> >> --conf spark.eventLog.enabled=false \
>> >>
>> >> --conf spark.eventLog.overwrite=true \
>> >>
>> >> --conf 

Re: Improving performance of a kafka spark streaming app

2016-06-18 Thread Mich Talebzadeh
I believe you have an issue with performance?

have you checked spark GUI (default 4040) for details including shuffles
etc?

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 18 June 2016 at 20:59, Colin Kincaid Williams  wrote:

> There are 25 nodes in the spark cluster.
>
> On Sat, Jun 18, 2016 at 7:53 PM, Mich Talebzadeh
>  wrote:
> > how many nodes are in your cluster?
> >
> > --num-executors 6 \
> >  --driver-memory 4G \
> >  --executor-memory 2G \
> >  --total-executor-cores 12 \
> >
> >
> > Dr Mich Talebzadeh
> >
> >
> >
> > LinkedIn
> >
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >
> >
> >
> > http://talebzadehmich.wordpress.com
> >
> >
> >
> >
> > On 18 June 2016 at 20:40, Colin Kincaid Williams  wrote:
> >>
> >> I updated my app to Spark 1.5.2 streaming so that it consumes from
> >> Kafka using the direct api and inserts content into an hbase cluster,
> >> as described in this thread. I was away from this project for awhile
> >> due to events in my family.
> >>
> >> Currently my scheduling delay is high, but the processing time is
> >> stable around a second. I changed my setup to use 6 kafka partitions
> >> on a set of smaller kafka brokers, with fewer disks. I've included
> >> some details below, including the script I use to launch the
> >> application. I'm using a Spark on Hbase library, whose version is
> >> relevant to my Hbase cluster. Is it apparent there is something wrong
> >> with my launch method that could be causing the delay, related to the
> >> included jars?
> >>
> >> Or is there something wrong with the very simple approach I'm taking
> >> for the application?
> >>
> >> Any advice is appriciated.
> >>
> >>
> >> The application:
> >>
> >> https://gist.github.com/drocsid/b0efa4ff6ff4a7c3c8bb56767d0b6877
> >>
> >>
> >> From the streaming UI I get something like:
> >>
> >> table Completed Batches (last 1000 out of 27136)
> >>
> >>
> >> Batch Time Input Size Scheduling Delay (?) Processing Time (?) Total
> >> Delay (?) Output Ops: Succeeded/Total
> >>
> >> 2016/06/18 11:21:32 3000 events 1.2 h 1 s 1.2 h 1/1
> >>
> >> 2016/06/18 11:21:31 3000 events 1.2 h 1 s 1.2 h 1/1
> >>
> >> 2016/06/18 11:21:30 3000 events 1.2 h 1 s 1.2 h 1/1
> >>
> >>
> >> Here's how I'm launching the spark application.
> >>
> >>
> >> #!/usr/bin/env bash
> >>
> >> export SPARK_CONF_DIR=/home/colin.williams/spark
> >>
> >> export HADOOP_CONF_DIR=/etc/hadoop/conf
> >>
> >> export
> >>
> HADOOP_CLASSPATH=/home/colin.williams/hbase/conf/:/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/*:/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/hbase-protocol-0.98.6-cdh5.3.0.jar
> >>
> >>
> >> /opt/spark-1.5.2-bin-hadoop2.4/bin/spark-submit \
> >>
> >> --class com.example.KafkaToHbase \
> >>
> >> --master spark://spark_master:7077 \
> >>
> >> --deploy-mode client \
> >>
> >> --num-executors 6 \
> >>
> >> --driver-memory 4G \
> >>
> >> --executor-memory 2G \
> >>
> >> --total-executor-cores 12 \
> >>
> >> --jars
> >>
> /home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/zookeeper/zookeeper-3.4.5-cdh5.3.0.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/guava-12.0.1.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/protobuf-java-2.5.0.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-protocol.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-client.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-common.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-hadoop2-compat.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-hadoop-compat.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-server.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/htrace-core.jar
> >> \
> >>
> >> --conf spark.app.name="Kafka To Hbase" \
> >>
> >> --conf spark.eventLog.dir="hdfs:///user/spark/applicationHistory" \
> >>
> >> --conf spark.eventLog.enabled=false \
> >>
> >> --conf spark.eventLog.overwrite=true \
> >>
> >> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
> >>
> >> --conf spark.streaming.backpressure.enabled=false \
> >>
> >> --conf spark.streaming.kafka.maxRatePerPartition=500 \
> >>
> >> --driver-class-path /home/colin.williams/kafka-hbase.jar \
> >>
> >> --driver-java-options
> >>
> >>
> -Dspark.executor.extraClassPath=/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/*
> >> \
> >>
> >> /home/colin.williams/kafka-hbase.jar "FromTable" "ToTable"
> >> "broker1:9092,broker2:9092"
> >>
> >> On Tue, May 3, 2016 at 8:20 PM, Colin Kincaid Williams 
> >> wrote:
> >> > Thanks Cody, I can see that the partitions are well 

Re: Improving performance of a kafka spark streaming app

2016-06-18 Thread Colin Kincaid Williams
I'm attaching a picture from the streaming UI.

On Sat, Jun 18, 2016 at 7:59 PM, Colin Kincaid Williams  wrote:
> There are 25 nodes in the spark cluster.
>
> On Sat, Jun 18, 2016 at 7:53 PM, Mich Talebzadeh
>  wrote:
>> how many nodes are in your cluster?
>>
>> --num-executors 6 \
>>  --driver-memory 4G \
>>  --executor-memory 2G \
>>  --total-executor-cores 12 \
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>>
>> On 18 June 2016 at 20:40, Colin Kincaid Williams  wrote:
>>>
>>> I updated my app to Spark 1.5.2 streaming so that it consumes from
>>> Kafka using the direct api and inserts content into an hbase cluster,
>>> as described in this thread. I was away from this project for awhile
>>> due to events in my family.
>>>
>>> Currently my scheduling delay is high, but the processing time is
>>> stable around a second. I changed my setup to use 6 kafka partitions
>>> on a set of smaller kafka brokers, with fewer disks. I've included
>>> some details below, including the script I use to launch the
>>> application. I'm using a Spark on Hbase library, whose version is
>>> relevant to my Hbase cluster. Is it apparent there is something wrong
>>> with my launch method that could be causing the delay, related to the
>>> included jars?
>>>
>>> Or is there something wrong with the very simple approach I'm taking
>>> for the application?
>>>
>>> Any advice is appriciated.
>>>
>>>
>>> The application:
>>>
>>> https://gist.github.com/drocsid/b0efa4ff6ff4a7c3c8bb56767d0b6877
>>>
>>>
>>> From the streaming UI I get something like:
>>>
>>> table Completed Batches (last 1000 out of 27136)
>>>
>>>
>>> Batch Time Input Size Scheduling Delay (?) Processing Time (?) Total
>>> Delay (?) Output Ops: Succeeded/Total
>>>
>>> 2016/06/18 11:21:32 3000 events 1.2 h 1 s 1.2 h 1/1
>>>
>>> 2016/06/18 11:21:31 3000 events 1.2 h 1 s 1.2 h 1/1
>>>
>>> 2016/06/18 11:21:30 3000 events 1.2 h 1 s 1.2 h 1/1
>>>
>>>
>>> Here's how I'm launching the spark application.
>>>
>>>
>>> #!/usr/bin/env bash
>>>
>>> export SPARK_CONF_DIR=/home/colin.williams/spark
>>>
>>> export HADOOP_CONF_DIR=/etc/hadoop/conf
>>>
>>> export
>>> HADOOP_CLASSPATH=/home/colin.williams/hbase/conf/:/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/*:/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/hbase-protocol-0.98.6-cdh5.3.0.jar
>>>
>>>
>>> /opt/spark-1.5.2-bin-hadoop2.4/bin/spark-submit \
>>>
>>> --class com.example.KafkaToHbase \
>>>
>>> --master spark://spark_master:7077 \
>>>
>>> --deploy-mode client \
>>>
>>> --num-executors 6 \
>>>
>>> --driver-memory 4G \
>>>
>>> --executor-memory 2G \
>>>
>>> --total-executor-cores 12 \
>>>
>>> --jars
>>> /home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/zookeeper/zookeeper-3.4.5-cdh5.3.0.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/guava-12.0.1.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/protobuf-java-2.5.0.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-protocol.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-client.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-common.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-hadoop2-compat.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-hadoop-compat.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-server.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/htrace-core.jar
>>> \
>>>
>>> --conf spark.app.name="Kafka To Hbase" \
>>>
>>> --conf spark.eventLog.dir="hdfs:///user/spark/applicationHistory" \
>>>
>>> --conf spark.eventLog.enabled=false \
>>>
>>> --conf spark.eventLog.overwrite=true \
>>>
>>> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
>>>
>>> --conf spark.streaming.backpressure.enabled=false \
>>>
>>> --conf spark.streaming.kafka.maxRatePerPartition=500 \
>>>
>>> --driver-class-path /home/colin.williams/kafka-hbase.jar \
>>>
>>> --driver-java-options
>>>
>>> -Dspark.executor.extraClassPath=/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/*
>>> \
>>>
>>> /home/colin.williams/kafka-hbase.jar "FromTable" "ToTable"
>>> "broker1:9092,broker2:9092"
>>>
>>> On Tue, May 3, 2016 at 8:20 PM, Colin Kincaid Williams 
>>> wrote:
>>> > Thanks Cody, I can see that the partitions are well distributed...
>>> > Then I'm in the process of using the direct api.
>>> >
>>> > On Tue, May 3, 2016 at 6:51 PM, Cody Koeninger 
>>> > wrote:
>>> >> 60 partitions in and of itself shouldn't be a big performance issue
>>> >> (as long as producers are distributing across partitions evenly).
>>> >>
>>> >> On Tue, May 3, 2016 at 1:44 PM, Colin Kincaid Williams 
>>> >> wrote:
>>> >>> Thanks again Cody. Regarding the details 

Re: Improving performance of a kafka spark streaming app

2016-06-18 Thread Colin Kincaid Williams
There are 25 nodes in the spark cluster.

On Sat, Jun 18, 2016 at 7:53 PM, Mich Talebzadeh
 wrote:
> how many nodes are in your cluster?
>
> --num-executors 6 \
>  --driver-memory 4G \
>  --executor-memory 2G \
>  --total-executor-cores 12 \
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
>
> On 18 June 2016 at 20:40, Colin Kincaid Williams  wrote:
>>
>> I updated my app to Spark 1.5.2 streaming so that it consumes from
>> Kafka using the direct api and inserts content into an hbase cluster,
>> as described in this thread. I was away from this project for awhile
>> due to events in my family.
>>
>> Currently my scheduling delay is high, but the processing time is
>> stable around a second. I changed my setup to use 6 kafka partitions
>> on a set of smaller kafka brokers, with fewer disks. I've included
>> some details below, including the script I use to launch the
>> application. I'm using a Spark on Hbase library, whose version is
>> relevant to my Hbase cluster. Is it apparent there is something wrong
>> with my launch method that could be causing the delay, related to the
>> included jars?
>>
>> Or is there something wrong with the very simple approach I'm taking
>> for the application?
>>
>> Any advice is appriciated.
>>
>>
>> The application:
>>
>> https://gist.github.com/drocsid/b0efa4ff6ff4a7c3c8bb56767d0b6877
>>
>>
>> From the streaming UI I get something like:
>>
>> table Completed Batches (last 1000 out of 27136)
>>
>>
>> Batch Time Input Size Scheduling Delay (?) Processing Time (?) Total
>> Delay (?) Output Ops: Succeeded/Total
>>
>> 2016/06/18 11:21:32 3000 events 1.2 h 1 s 1.2 h 1/1
>>
>> 2016/06/18 11:21:31 3000 events 1.2 h 1 s 1.2 h 1/1
>>
>> 2016/06/18 11:21:30 3000 events 1.2 h 1 s 1.2 h 1/1
>>
>>
>> Here's how I'm launching the spark application.
>>
>>
>> #!/usr/bin/env bash
>>
>> export SPARK_CONF_DIR=/home/colin.williams/spark
>>
>> export HADOOP_CONF_DIR=/etc/hadoop/conf
>>
>> export
>> HADOOP_CLASSPATH=/home/colin.williams/hbase/conf/:/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/*:/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/hbase-protocol-0.98.6-cdh5.3.0.jar
>>
>>
>> /opt/spark-1.5.2-bin-hadoop2.4/bin/spark-submit \
>>
>> --class com.example.KafkaToHbase \
>>
>> --master spark://spark_master:7077 \
>>
>> --deploy-mode client \
>>
>> --num-executors 6 \
>>
>> --driver-memory 4G \
>>
>> --executor-memory 2G \
>>
>> --total-executor-cores 12 \
>>
>> --jars
>> /home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/zookeeper/zookeeper-3.4.5-cdh5.3.0.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/guava-12.0.1.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/protobuf-java-2.5.0.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-protocol.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-client.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-common.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-hadoop2-compat.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-hadoop-compat.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-server.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/htrace-core.jar
>> \
>>
>> --conf spark.app.name="Kafka To Hbase" \
>>
>> --conf spark.eventLog.dir="hdfs:///user/spark/applicationHistory" \
>>
>> --conf spark.eventLog.enabled=false \
>>
>> --conf spark.eventLog.overwrite=true \
>>
>> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
>>
>> --conf spark.streaming.backpressure.enabled=false \
>>
>> --conf spark.streaming.kafka.maxRatePerPartition=500 \
>>
>> --driver-class-path /home/colin.williams/kafka-hbase.jar \
>>
>> --driver-java-options
>>
>> -Dspark.executor.extraClassPath=/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/*
>> \
>>
>> /home/colin.williams/kafka-hbase.jar "FromTable" "ToTable"
>> "broker1:9092,broker2:9092"
>>
>> On Tue, May 3, 2016 at 8:20 PM, Colin Kincaid Williams 
>> wrote:
>> > Thanks Cody, I can see that the partitions are well distributed...
>> > Then I'm in the process of using the direct api.
>> >
>> > On Tue, May 3, 2016 at 6:51 PM, Cody Koeninger 
>> > wrote:
>> >> 60 partitions in and of itself shouldn't be a big performance issue
>> >> (as long as producers are distributing across partitions evenly).
>> >>
>> >> On Tue, May 3, 2016 at 1:44 PM, Colin Kincaid Williams 
>> >> wrote:
>> >>> Thanks again Cody. Regarding the details 66 kafka partitions on 3
>> >>> kafka servers, likely 8 core systems with 10 disks each. Maybe the
>> >>> issue with the receiver was the large number of partitions. I had
>> >>> miscounted the disks and so 11*3*2 is how I decided to partition my
>> >>> topic on 

Re: Improving performance of a kafka spark streaming app

2016-06-18 Thread Mich Talebzadeh
how many nodes are in your cluster?

--num-executors 6 \
 --driver-memory 4G \
 --executor-memory 2G \
 --total-executor-cores 12 \


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 18 June 2016 at 20:40, Colin Kincaid Williams  wrote:

> I updated my app to Spark 1.5.2 streaming so that it consumes from
> Kafka using the direct api and inserts content into an hbase cluster,
> as described in this thread. I was away from this project for awhile
> due to events in my family.
>
> Currently my scheduling delay is high, but the processing time is
> stable around a second. I changed my setup to use 6 kafka partitions
> on a set of smaller kafka brokers, with fewer disks. I've included
> some details below, including the script I use to launch the
> application. I'm using a Spark on Hbase library, whose version is
> relevant to my Hbase cluster. Is it apparent there is something wrong
> with my launch method that could be causing the delay, related to the
> included jars?
>
> Or is there something wrong with the very simple approach I'm taking
> for the application?
>
> Any advice is appriciated.
>
>
> The application:
>
> https://gist.github.com/drocsid/b0efa4ff6ff4a7c3c8bb56767d0b6877
>
>
> From the streaming UI I get something like:
>
> table Completed Batches (last 1000 out of 27136)
>
>
> Batch Time Input Size Scheduling Delay (?) Processing Time (?) Total
> Delay (?) Output Ops: Succeeded/Total
>
> 2016/06/18 11:21:32 3000 events 1.2 h 1 s 1.2 h 1/1
>
> 2016/06/18 11:21:31 3000 events 1.2 h 1 s 1.2 h 1/1
>
> 2016/06/18 11:21:30 3000 events 1.2 h 1 s 1.2 h 1/1
>
>
> Here's how I'm launching the spark application.
>
>
> #!/usr/bin/env bash
>
> export SPARK_CONF_DIR=/home/colin.williams/spark
>
> export HADOOP_CONF_DIR=/etc/hadoop/conf
>
> export
> HADOOP_CLASSPATH=/home/colin.williams/hbase/conf/:/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/*:/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/hbase-protocol-0.98.6-cdh5.3.0.jar
>
>
> /opt/spark-1.5.2-bin-hadoop2.4/bin/spark-submit \
>
> --class com.example.KafkaToHbase \
>
> --master spark://spark_master:7077 \
>
> --deploy-mode client \
>
> --num-executors 6 \
>
> --driver-memory 4G \
>
> --executor-memory 2G \
>
> --total-executor-cores 12 \
>
> --jars
> /home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/zookeeper/zookeeper-3.4.5-cdh5.3.0.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/guava-12.0.1.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/protobuf-java-2.5.0.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-protocol.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-client.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-common.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-hadoop2-compat.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-hadoop-compat.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-server.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/htrace-core.jar
> \
>
> --conf spark.app.name="Kafka To Hbase" \
>
> --conf spark.eventLog.dir="hdfs:///user/spark/applicationHistory" \
>
> --conf spark.eventLog.enabled=false \
>
> --conf spark.eventLog.overwrite=true \
>
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
>
> --conf spark.streaming.backpressure.enabled=false \
>
> --conf spark.streaming.kafka.maxRatePerPartition=500 \
>
> --driver-class-path /home/colin.williams/kafka-hbase.jar \
>
> --driver-java-options
>
> -Dspark.executor.extraClassPath=/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/*
> \
>
> /home/colin.williams/kafka-hbase.jar "FromTable" "ToTable"
> "broker1:9092,broker2:9092"
>
> On Tue, May 3, 2016 at 8:20 PM, Colin Kincaid Williams 
> wrote:
> > Thanks Cody, I can see that the partitions are well distributed...
> > Then I'm in the process of using the direct api.
> >
> > On Tue, May 3, 2016 at 6:51 PM, Cody Koeninger 
> wrote:
> >> 60 partitions in and of itself shouldn't be a big performance issue
> >> (as long as producers are distributing across partitions evenly).
> >>
> >> On Tue, May 3, 2016 at 1:44 PM, Colin Kincaid Williams 
> wrote:
> >>> Thanks again Cody. Regarding the details 66 kafka partitions on 3
> >>> kafka servers, likely 8 core systems with 10 disks each. Maybe the
> >>> issue with the receiver was the large number of partitions. I had
> >>> miscounted the disks and so 11*3*2 is how I decided to partition my
> >>> topic on insertion, ( by my own, unjustified reasoning, on a first
> >>> attempt ) . This worked well enough for me, I put 1.7 billion entries
> >>> into Kafka on a map reduce job in 5 and a half hours.
> >>>

Re: Improving performance of a kafka spark streaming app

2016-06-18 Thread Colin Kincaid Williams
I updated my app to Spark 1.5.2 streaming so that it consumes from
Kafka using the direct api and inserts content into an hbase cluster,
as described in this thread. I was away from this project for awhile
due to events in my family.

Currently my scheduling delay is high, but the processing time is
stable around a second. I changed my setup to use 6 kafka partitions
on a set of smaller kafka brokers, with fewer disks. I've included
some details below, including the script I use to launch the
application. I'm using a Spark on Hbase library, whose version is
relevant to my Hbase cluster. Is it apparent there is something wrong
with my launch method that could be causing the delay, related to the
included jars?

Or is there something wrong with the very simple approach I'm taking
for the application?

Any advice is appriciated.


The application:

https://gist.github.com/drocsid/b0efa4ff6ff4a7c3c8bb56767d0b6877


>From the streaming UI I get something like:

table Completed Batches (last 1000 out of 27136)


Batch Time Input Size Scheduling Delay (?) Processing Time (?) Total
Delay (?) Output Ops: Succeeded/Total

2016/06/18 11:21:32 3000 events 1.2 h 1 s 1.2 h 1/1

2016/06/18 11:21:31 3000 events 1.2 h 1 s 1.2 h 1/1

2016/06/18 11:21:30 3000 events 1.2 h 1 s 1.2 h 1/1


Here's how I'm launching the spark application.


#!/usr/bin/env bash

export SPARK_CONF_DIR=/home/colin.williams/spark

export HADOOP_CONF_DIR=/etc/hadoop/conf

export 
HADOOP_CLASSPATH=/home/colin.williams/hbase/conf/:/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/*:/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/hbase-protocol-0.98.6-cdh5.3.0.jar


/opt/spark-1.5.2-bin-hadoop2.4/bin/spark-submit \

--class com.example.KafkaToHbase \

--master spark://spark_master:7077 \

--deploy-mode client \

--num-executors 6 \

--driver-memory 4G \

--executor-memory 2G \

--total-executor-cores 12 \

--jars 
/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/zookeeper/zookeeper-3.4.5-cdh5.3.0.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/guava-12.0.1.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/protobuf-java-2.5.0.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-protocol.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-client.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-common.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-hadoop2-compat.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-hadoop-compat.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-server.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/htrace-core.jar
\

--conf spark.app.name="Kafka To Hbase" \

--conf spark.eventLog.dir="hdfs:///user/spark/applicationHistory" \

--conf spark.eventLog.enabled=false \

--conf spark.eventLog.overwrite=true \

--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \

--conf spark.streaming.backpressure.enabled=false \

--conf spark.streaming.kafka.maxRatePerPartition=500 \

--driver-class-path /home/colin.williams/kafka-hbase.jar \

--driver-java-options
-Dspark.executor.extraClassPath=/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/*
\

/home/colin.williams/kafka-hbase.jar "FromTable" "ToTable"
"broker1:9092,broker2:9092"

On Tue, May 3, 2016 at 8:20 PM, Colin Kincaid Williams  wrote:
> Thanks Cody, I can see that the partitions are well distributed...
> Then I'm in the process of using the direct api.
>
> On Tue, May 3, 2016 at 6:51 PM, Cody Koeninger  wrote:
>> 60 partitions in and of itself shouldn't be a big performance issue
>> (as long as producers are distributing across partitions evenly).
>>
>> On Tue, May 3, 2016 at 1:44 PM, Colin Kincaid Williams  
>> wrote:
>>> Thanks again Cody. Regarding the details 66 kafka partitions on 3
>>> kafka servers, likely 8 core systems with 10 disks each. Maybe the
>>> issue with the receiver was the large number of partitions. I had
>>> miscounted the disks and so 11*3*2 is how I decided to partition my
>>> topic on insertion, ( by my own, unjustified reasoning, on a first
>>> attempt ) . This worked well enough for me, I put 1.7 billion entries
>>> into Kafka on a map reduce job in 5 and a half hours.
>>>
>>> I was concerned using spark 1.5.2 because I'm currently putting my
>>> data into a CDH 5.3 HDFS cluster, using hbase-spark .98 library jars
>>> built for spark 1.2 on CDH 5.3. But after debugging quite a bit
>>> yesterday, I tried building against 1.5.2. So far it's running without
>>> issue on a Spark 1.5.2 cluster. I'm not sure there was too much
>>> improvement using the same code, but I'll see how the direct api
>>> handles it. In the end I can reduce the number of partitions in Kafka
>>> if it causes big performance issues.
>>>
>>> On Tue, May 3, 2016 at 4:08 AM, Cody Koeninger  wrote:
 

Spark not using all the cluster instances in AWS EMR

2016-06-18 Thread Natu Lauchande
Hi,

I am running some spark loads . I notice that in  it only uses one of the
machines(instead of the 3 available) of the cluster.

Is there any parameter that can be set to force it to use all the cluster.

I am using AWS EMR with Yarn.


Thanks,
Natu


Re: unsubscribe error

2016-06-18 Thread Mich Talebzadeh
How do you unsubscribe with the email that it saysmarco.plata...@yahoo.it
.invalid . When
the manager program sends you confirmation to unsubscribe it is probably
bouncing back!


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 18 June 2016 at 20:04, Marco Platania 
wrote:

> Dear admin,
>
> I've tried to unsubscribe from this mailing list twice, but I'm still
> receiving emails. Can you please fix this?
>
> Thanks,
> Marco
>


unsubscribe error

2016-06-18 Thread Marco Platania
Dear admin,
I've tried to unsubscribe from this mailing list twice, but I'm still receiving 
emails. Can you please fix this?
Thanks,Marco

CfP for Spark Summit Brussels, 2016

2016-06-18 Thread Jules Damji
Hello All,
Just in case you missed, Spark Summit is returning to Europe, October 25-27, 
2016, and the Call for Presentations is open. 

Submit your Cfp before July 1 https://spark-summit.org/eu-2016/

Cheers,

Jules

Community Evangelist 

Databricks, Inc

Sent from my iPhone
Pardon the dumb thumb typos :)



Re: Running JavaBased Implementationof StreamingKmeans

2016-06-18 Thread Biplob Biswas
Hi,

I tried local[*] and local[2] and the result is the same. I don't really
understand the problem here.
How can I confirm that the files are read properly?

Thanks & Regards
Biplob Biswas

On Sat, Jun 18, 2016 at 5:59 PM, Akhil Das  wrote:

> Looks like you need to set your master to local[2] or local[*]
>
> On Sat, Jun 18, 2016 at 4:54 PM, Biplob Biswas 
> wrote:
>
>> Hi,
>>
>> I implemented the streamingKmeans example provided in the spark website
>> but
>> in Java.
>> The full implementation is here,
>>
>> http://pastebin.com/CJQfWNvk
>>
>> But i am not getting anything in the output except occasional timestamps
>> like one below:
>>
>> ---
>> Time: 1466176935000 ms
>> ---
>>
>> Also, i have 2 directories:
>> "D:\spark\streaming example\Data Sets\training"
>> "D:\spark\streaming example\Data Sets\test"
>>
>> and inside these directories i have 1 file each "samplegpsdata_train.txt"
>> and "samplegpsdata_test.txt" with training data having 500 datapoints and
>> test data with 60 datapoints.
>>
>> I am very new to the spark systems and any help is highly appreciated.
>>
>> Thank you so much
>> Biplob Biswas
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Running-JavaBased-Implementationof-StreamingKmeans-tp27192.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>
> --
> Cheers!
>
>


Running JavaBased Implementation of StreamingKmeans

2016-06-18 Thread Biplob Biswas
Hi,

I implemented the streamingKmeans example provided in the spark website but
in Java.
The full implementation is here,

http://pastebin.com/CJQfWNvk

But i am not getting anything in the output except occasional timestamps
like one below:

---
Time: 1466176935000 ms
---

Also, i have 2 directories:
"D:\spark\streaming example\Data Sets\training"
"D:\spark\streaming example\Data Sets\test"

and inside these directories i have 1 file each "samplegpsdata_train.txt"
and "samplegpsdata_test.txt" with training data having 500 datapoints and
test data with 60 datapoints.

I am very new to the spark systems and any help is highly appreciated.

Thanks & Regards
Biplob Biswas


Re: What does it mean when a executor has negative active tasks?

2016-06-18 Thread Brandon White
1.6
On Jun 18, 2016 10:02 AM, "Mich Talebzadeh" 
wrote:

> could be a bug as they are no failed jobs. what version of Spark is this?
>
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 18 June 2016 at 17:50, Brandon White  wrote:
>
>> What does it mean when a executor has negative active tasks?
>>
>>
>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>
>


Re: What does it mean when a executor has negative active tasks?

2016-06-18 Thread Mich Talebzadeh
could be a bug as they are no failed jobs. what version of Spark is this?


HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 18 June 2016 at 17:50, Brandon White  wrote:

> What does it mean when a executor has negative active tasks?
>
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>


Re: Many executors with the same ID in web UI (under Executors)?

2016-06-18 Thread Jacek Laskowski
Hi Mich,

That's correct -- they're indeed duplicates in the table but not on
OS. The reason for this *might* be that you need to have separate
stdout and stderr for the failed execution(s). I'm using
--num-executors 2 and there are two executor backends.

$ jps -l
28865 sun.tools.jps.Jps
802 com.typesafe.zinc.Nailgun
28276 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager
28804 org.apache.spark.executor.CoarseGrainedExecutorBackend
15450
28378 org.apache.hadoop.yarn.server.nodemanager.NodeManager
28778 org.apache.spark.executor.CoarseGrainedExecutorBackend
28748 org.apache.spark.deploy.yarn.ExecutorLauncher
28463 org.apache.spark.deploy.SparkSubmit

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Sat, Jun 18, 2016 at 6:16 PM, Mich Talebzadeh
 wrote:
> Can you please run jps on 1-node host and send the output. All those
> executor IDs some are just duplicates!
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
>
> On 18 June 2016 at 17:08, Jacek Laskowski  wrote:
>>
>> Hi,
>>
>> Thanks Mich and Akhil for such prompt responses! Here's the screenshot
>> [1] which is a part of
>> https://issues.apache.org/jira/browse/SPARK-16047 I reported today (to
>> have the executors sorted by status and id).
>>
>> [1]
>> https://issues.apache.org/jira/secure/attachment/12811665/spark-webui-executors.png
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>>
>> On Sat, Jun 18, 2016 at 6:05 PM, Akhil Das  wrote:
>> > A screenshot of the executor tab will explain it better. Usually
>> > executors
>> > are allocated when the job is started, if you have a multi-node cluster
>> > then
>> > you'll see executors launched on different nodes.
>> >
>> > On Sat, Jun 18, 2016 at 9:04 PM, Jacek Laskowski 
>> > wrote:
>> >>
>> >> Hi,
>> >>
>> >> This is for Spark on YARN - a 1-node cluster with Spark 2.0.0-SNAPSHOT
>> >> (today build)
>> >>
>> >> I can understand that when a stage fails a new executor entry shows up
>> >> in web UI under Executors tab (that corresponds to a stage attempt). I
>> >> understand that this is to keep the stdout and stderr logs for future
>> >> reference.
>> >>
>> >> Why are there multiple executor entries under the same executor IDs?
>> >> What are the executor entries exactly? When are the new ones created
>> >> (after a Spark application is launched and assigned the
>> >> --num-executors executors)?
>> >>
>> >> Pozdrawiam,
>> >> Jacek Laskowski
>> >> 
>> >> https://medium.com/@jaceklaskowski/
>> >> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>> >> Follow me at https://twitter.com/jaceklaskowski
>> >>
>> >> -
>> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> >> For additional commands, e-mail: user-h...@spark.apache.org
>> >>
>> >
>> >
>> >
>> > --
>> > Cheers!
>> >
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Many executors with the same ID in web UI (under Executors)?

2016-06-18 Thread Mich Talebzadeh
Can you please run jps on 1-node host and send the output. All those
executor IDs some are just duplicates!

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 18 June 2016 at 17:08, Jacek Laskowski  wrote:

> Hi,
>
> Thanks Mich and Akhil for such prompt responses! Here's the screenshot
> [1] which is a part of
> https://issues.apache.org/jira/browse/SPARK-16047 I reported today (to
> have the executors sorted by status and id).
>
> [1]
> https://issues.apache.org/jira/secure/attachment/12811665/spark-webui-executors.png
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Sat, Jun 18, 2016 at 6:05 PM, Akhil Das  wrote:
> > A screenshot of the executor tab will explain it better. Usually
> executors
> > are allocated when the job is started, if you have a multi-node cluster
> then
> > you'll see executors launched on different nodes.
> >
> > On Sat, Jun 18, 2016 at 9:04 PM, Jacek Laskowski 
> wrote:
> >>
> >> Hi,
> >>
> >> This is for Spark on YARN - a 1-node cluster with Spark 2.0.0-SNAPSHOT
> >> (today build)
> >>
> >> I can understand that when a stage fails a new executor entry shows up
> >> in web UI under Executors tab (that corresponds to a stage attempt). I
> >> understand that this is to keep the stdout and stderr logs for future
> >> reference.
> >>
> >> Why are there multiple executor entries under the same executor IDs?
> >> What are the executor entries exactly? When are the new ones created
> >> (after a Spark application is launched and assigned the
> >> --num-executors executors)?
> >>
> >> Pozdrawiam,
> >> Jacek Laskowski
> >> 
> >> https://medium.com/@jaceklaskowski/
> >> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> >> Follow me at https://twitter.com/jaceklaskowski
> >>
> >> -
> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >> For additional commands, e-mail: user-h...@spark.apache.org
> >>
> >
> >
> >
> > --
> > Cheers!
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: How to enable core dump in spark

2016-06-18 Thread Jacek Laskowski
What about the user of NodeManagers?

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Thu, Jun 16, 2016 at 10:51 PM, prateek arora
 wrote:
> hi
>
> I am using spark with yarn .  how can i  make sure that the ulimit settings
> are applied to the Spark process ?
>
> I set core dump limit to unlimited in all nodes .
>Edit  /etc/security/limits.conf file and add  " * soft core unlimited "
> line.
>
> i rechecked  using :  $ ulimit -all
>
> core file size  (blocks, -c) unlimited
> data seg size   (kbytes, -d) unlimited
> scheduling priority (-e) 0
> file size   (blocks, -f) unlimited
> pending signals (-i) 241204
> max locked memory   (kbytes, -l) 64
> max memory size (kbytes, -m) unlimited
> open files  (-n) 1024
> pipe size(512 bytes, -p) 8
> POSIX message queues (bytes, -q) 819200
> real-time priority  (-r) 0
> stack size  (kbytes, -s) 8192
> cpu time   (seconds, -t) unlimited
> max user processes  (-u) 241204
> virtual memory  (kbytes, -v) unlimited
> file locks  (-x) unlimited
>
> Regards
> Prateek
>
>
> On Thu, Jun 16, 2016 at 4:46 AM, Jacek Laskowski  wrote:
>>
>> Hi,
>>
>> Can you make sure that the ulimit settings are applied to the Spark
>> process? Is this Spark on YARN or Standalone?
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>>
>> On Wed, Jun 1, 2016 at 7:55 PM, prateek arora
>>  wrote:
>> > Hi
>> >
>> > I am using cloudera to  setup spark 1.6.0  on ubuntu 14.04 .
>> >
>> > I set core dump limit to unlimited in all nodes .
>> >Edit  /etc/security/limits.conf file and add  " * soft core unlimited
>> > "
>> > line.
>> >
>> > i rechecked  using :  $ ulimit -all
>> >
>> > core file size  (blocks, -c) unlimited
>> > data seg size   (kbytes, -d) unlimited
>> > scheduling priority (-e) 0
>> > file size   (blocks, -f) unlimited
>> > pending signals (-i) 241204
>> > max locked memory   (kbytes, -l) 64
>> > max memory size (kbytes, -m) unlimited
>> > open files  (-n) 1024
>> > pipe size(512 bytes, -p) 8
>> > POSIX message queues (bytes, -q) 819200
>> > real-time priority  (-r) 0
>> > stack size  (kbytes, -s) 8192
>> > cpu time   (seconds, -t) unlimited
>> > max user processes  (-u) 241204
>> > virtual memory  (kbytes, -v) unlimited
>> > file locks  (-x) unlimited
>> >
>> > but when I am running my spark application with some third party native
>> > libraries . but it crashes some time and show error " Failed to write
>> > core
>> > dump. Core dumps have been disabled. To enable core dumping, try "ulimit
>> > -c
>> > unlimited" before starting Java again " .
>> >
>> > Below are the log :
>> >
>> >  A fatal error has been detected by the Java Runtime Environment:
>> > #
>> > #  SIGSEGV (0xb) at pc=0x7fd44b491fb9, pid=20458,
>> > tid=140549318547200
>> > #
>> > # JRE version: Java(TM) SE Runtime Environment (7.0_67-b01) (build
>> > 1.7.0_67-b01)
>> > # Java VM: Java HotSpot(TM) 64-Bit Server VM (24.65-b04 mixed mode
>> > linux-amd64 compressed oops)
>> > # Problematic frame:
>> > # V  [libjvm.so+0x650fb9]  jni_SetByteArrayRegion+0xa9
>> > #
>> > # Failed to write core dump. Core dumps have been disabled. To enable
>> > core
>> > dumping, try "ulimit -c unlimited" before starting Java again
>> > #
>> > # An error report file with more information is saved as:
>> > #
>> >
>> > /yarn/nm/usercache/master/appcache/application_1462930975871_0004/container_1462930975871_0004_01_66/hs_err_pid20458.log
>> > #
>> > # If you would like to submit a bug report, please visit:
>> > #   http://bugreport.sun.com/bugreport/crash.jsp
>> > #
>> >
>> >
>> > so how can i enable core dump and save it some place ?
>> >
>> > Regards
>> > Prateek
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> > http://apache-spark-user-list.1001560.n3.nabble.com/How-to-enable-core-dump-in-spark-tp27065.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> >
>> > -
>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: user-h...@spark.apache.org
>> >
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Many executors with the same ID in web UI (under Executors)?

2016-06-18 Thread Jacek Laskowski
Hi,

Thanks Mich and Akhil for such prompt responses! Here's the screenshot
[1] which is a part of
https://issues.apache.org/jira/browse/SPARK-16047 I reported today (to
have the executors sorted by status and id).

[1] 
https://issues.apache.org/jira/secure/attachment/12811665/spark-webui-executors.png

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Sat, Jun 18, 2016 at 6:05 PM, Akhil Das  wrote:
> A screenshot of the executor tab will explain it better. Usually executors
> are allocated when the job is started, if you have a multi-node cluster then
> you'll see executors launched on different nodes.
>
> On Sat, Jun 18, 2016 at 9:04 PM, Jacek Laskowski  wrote:
>>
>> Hi,
>>
>> This is for Spark on YARN - a 1-node cluster with Spark 2.0.0-SNAPSHOT
>> (today build)
>>
>> I can understand that when a stage fails a new executor entry shows up
>> in web UI under Executors tab (that corresponds to a stage attempt). I
>> understand that this is to keep the stdout and stderr logs for future
>> reference.
>>
>> Why are there multiple executor entries under the same executor IDs?
>> What are the executor entries exactly? When are the new ones created
>> (after a Spark application is launched and assigned the
>> --num-executors executors)?
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>
>
>
> --
> Cheers!
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Can I control the execution of Spark jobs?

2016-06-18 Thread Jacek Laskowski
Hi,

Ahh, that makes sense now.

Spark works like this by default. You just do your 1st pipeline and
then another one (and perhaps some more). Since the pipelines are
processed serially (one by one) you implicitly create a dependency
between Spark jobs. You need no special steps to have it.

pipeline == load a dataset, transform it and save it to persistent storage

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Fri, Jun 17, 2016 at 4:15 AM, Haopu Wang  wrote:
> Jacek,
>
> For example, one ETL job is saving raw events and update a file.
> The other job is using that file's content to process the data set.
>
> In this case, the first job has to be done before the second one. That's what 
> I mean by dependency. Any suggestions/comments are appreciated.
>
> -Original Message-
> From: Jacek Laskowski [mailto:ja...@japila.pl]
> Sent: 2016年6月16日 19:09
> To: user
> Subject: Re: Can I control the execution of Spark jobs?
>
> Hi,
>
> When you say "several ETL types of things", what is this exactly? What
> would an example of "dependency between these jobs" be?
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Thu, Jun 16, 2016 at 11:36 AM, Haopu Wang  wrote:
>> Hi,
>>
>>
>>
>> Suppose I have a spark application which is doing several ETL types of
>> things.
>>
>> I understand Spark can analyze and generate several jobs to execute.
>>
>> The question is: is it possible to control the dependency between these
>> jobs?
>>
>>
>>
>> Thanks!
>>
>>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Many executors with the same ID in web UI (under Executors)?

2016-06-18 Thread Akhil Das
A screenshot of the executor tab will explain it better. Usually executors
are allocated when the job is started, if you have a multi-node cluster
then you'll see executors launched on different nodes.

On Sat, Jun 18, 2016 at 9:04 PM, Jacek Laskowski  wrote:

> Hi,
>
> This is for Spark on YARN - a 1-node cluster with Spark 2.0.0-SNAPSHOT
> (today build)
>
> I can understand that when a stage fails a new executor entry shows up
> in web UI under Executors tab (that corresponds to a stage attempt). I
> understand that this is to keep the stdout and stderr logs for future
> reference.
>
> Why are there multiple executor entries under the same executor IDs?
> What are the executor entries exactly? When are the new ones created
> (after a Spark application is launched and assigned the
> --num-executors executors)?
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Cheers!


Re: Many executors with the same ID in web UI (under Executors)?

2016-06-18 Thread Mich Talebzadeh
Hi Jacek,

Can you take a snapshot of your GUI /executors and GUI /Environment.

On a single node cluster The executor ID is the driver?

But we can find out all from the Environment snapshot (snipping tool)

HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 18 June 2016 at 15:04, Jacek Laskowski  wrote:

> Hi,
>
> This is for Spark on YARN - a 1-node cluster with Spark 2.0.0-SNAPSHOT
> (today build)
>
> I can understand that when a stage fails a new executor entry shows up
> in web UI under Executors tab (that corresponds to a stage attempt). I
> understand that this is to keep the stdout and stderr logs for future
> reference.
>
> Why are there multiple executor entries under the same executor IDs?
> What are the executor entries exactly? When are the new ones created
> (after a Spark application is launched and assigned the
> --num-executors executors)?
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Running JavaBased Implementationof StreamingKmeans

2016-06-18 Thread Akhil Das
Looks like you need to set your master to local[2] or local[*]

On Sat, Jun 18, 2016 at 4:54 PM, Biplob Biswas 
wrote:

> Hi,
>
> I implemented the streamingKmeans example provided in the spark website but
> in Java.
> The full implementation is here,
>
> http://pastebin.com/CJQfWNvk
>
> But i am not getting anything in the output except occasional timestamps
> like one below:
>
> ---
> Time: 1466176935000 ms
> ---
>
> Also, i have 2 directories:
> "D:\spark\streaming example\Data Sets\training"
> "D:\spark\streaming example\Data Sets\test"
>
> and inside these directories i have 1 file each "samplegpsdata_train.txt"
> and "samplegpsdata_test.txt" with training data having 500 datapoints and
> test data with 60 datapoints.
>
> I am very new to the spark systems and any help is highly appreciated.
>
> Thank you so much
> Biplob Biswas
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Running-JavaBased-Implementationof-StreamingKmeans-tp27192.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Cheers!


Re: Dataset Select Function after Aggregate Error

2016-06-18 Thread Ted Yu
scala> ds.groupBy($"_1").count.select(expr("_1").as[String],
expr("count").as[Long])
res0: org.apache.spark.sql.Dataset[(String, Long)] = [_1: int, count:
bigint]

scala> ds.groupBy($"_1").count.select(expr("_1").as[String],
expr("count").as[Long]).show
+---+-+
| _1|count|
+---+-+
|  1|1|
|  2|1|
+---+-+

On Sat, Jun 18, 2016 at 8:29 AM, Pedro Rodriguez 
wrote:

> I am curious if there is a way to call this so that it becomes a compile
> error rather than runtime error:
>
> // Note mispelled count and name
> ds.groupBy($"name").count.select('nam, $"coun").show
>
> More specifically, what are the best type safety guarantees that Datasets
> provide? It seems like with Dataframes there is still the unsafety of
> specifying column names by string/symbol and expecting the type to be
> correct and exist, but if you do something like this then downstream code
> is safer:
>
> // This is Array[(String, Long)] instead of Array[sql.Row]
> ds.groupBy($"name").count.select('name.as[String], 'count.as
> [Long]).collect()
>
> Does that seem like a correct understanding of Datasets?
>
> On Sat, Jun 18, 2016 at 6:39 AM, Pedro Rodriguez 
> wrote:
>
>> Looks like it was my own fault. I had spark 2.0 cloned/built, but had the
>> spark shell in my path so somehow 1.6.1 was being used instead of 2.0.
>> Thanks
>>
>> On Sat, Jun 18, 2016 at 1:16 AM, Takeshi Yamamuro 
>> wrote:
>>
>>> which version you use?
>>> I passed in 2.0-preview as follows;
>>> ---
>>>
>>> Spark context available as 'sc' (master = local[*], app id =
>>> local-1466234043659).
>>>
>>> Spark session available as 'spark'.
>>>
>>> Welcome to
>>>
>>>     __
>>>
>>>  / __/__  ___ _/ /__
>>>
>>> _\ \/ _ \/ _ `/ __/  '_/
>>>
>>>/___/ .__/\_,_/_/ /_/\_\   version 2.0.0-preview
>>>
>>>   /_/
>>>
>>>
>>>
>>> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java
>>> 1.8.0_31)
>>>
>>> Type in expressions to have them evaluated.
>>>
>>> Type :help for more information.
>>>
>>>
>>> scala> val ds = Seq[Tuple2[Int, Int]]((1, 0), (2, 0)).toDS
>>>
>>> hive.metastore.schema.verification is not enabled so recording the
>>> schema version 1.2.0
>>>
>>> ds: org.apache.spark.sql.Dataset[(Int, Int)] = [_1: int, _2: int]
>>>
>>> scala> ds.groupBy($"_1").count.select($"_1", $"count").show
>>>
>>> +---+-+
>>>
>>> | _1|count|
>>>
>>> +---+-+
>>>
>>> |  1|1|
>>>
>>> |  2|1|
>>>
>>> +---+-+
>>>
>>>
>>>
>>> On Sat, Jun 18, 2016 at 3:09 PM, Pedro Rodriguez <
>>> ski.rodrig...@gmail.com> wrote:
>>>
 I went ahead and downloaded/compiled Spark 2.0 to try your code snippet
 Takeshi. It unfortunately doesn't compile.

 scala> val ds = Seq[Tuple2[Int, Int]]((1, 0), (2, 0)).toDS
 ds: org.apache.spark.sql.Dataset[(Int, Int)] = [_1: int, _2: int]

 scala> ds.groupBy($"_1").count.select($"_1", $"count").show
 :28: error: type mismatch;
  found   : org.apache.spark.sql.ColumnName
  required: org.apache.spark.sql.TypedColumn[(org.apache.spark.sql.Row,
 Long),?]
   ds.groupBy($"_1").count.select($"_1", $"count").show
  ^

 I also gave a try to Xinh's suggestion using the code snippet below
 (partially from spark docs)
 scala> val ds = Seq(Person("Andy", 32), Person("Andy", 2),
 Person("Pedro", 24), Person("Bob", 42)).toDS()
 scala> ds.groupBy(_.name).count.select($"name".as[String]).show
 org.apache.spark.sql.AnalysisException: cannot resolve 'name' given
 input columns: [];
 scala> ds.groupBy(_.name).count.select($"_1".as[String]).show
 org.apache.spark.sql.AnalysisException: cannot resolve 'name' given
 input columns: [];
 scala> ds.groupBy($"name").count.select($"_1".as[String]).show
 org.apache.spark.sql.AnalysisException: cannot resolve '_1' given input
 columns: [];

 Looks like there are empty columns for some reason, the code below
 works fine for the simple aggregate
 scala> ds.groupBy(_.name).count.show

 Would be great to see an idiomatic example of using aggregates like
 these mixed with spark.sql.functions.

 Pedro

 On Fri, Jun 17, 2016 at 9:59 PM, Pedro Rodriguez <
 ski.rodrig...@gmail.com> wrote:

> Thanks Xinh and Takeshi,
>
> I am trying to avoid map since my impression is that this uses a Scala
> closure so is not optimized as well as doing column-wise operations is.
>
> Looks like the $ notation is the way to go, thanks for the help. Is
> there an explanation of how this works? I imagine it is a method/function
> with its name defined as $ in Scala?
>
> Lastly, are there prelim Spark 2.0 docs? If there isn't a good
> description/guide of using this syntax I would be willing to contribute
> some documentation.
>
> Pedro
>
> 

Re: Dataset Select Function after Aggregate Error

2016-06-18 Thread Pedro Rodriguez
I am curious if there is a way to call this so that it becomes a compile
error rather than runtime error:

// Note mispelled count and name
ds.groupBy($"name").count.select('nam, $"coun").show

More specifically, what are the best type safety guarantees that Datasets
provide? It seems like with Dataframes there is still the unsafety of
specifying column names by string/symbol and expecting the type to be
correct and exist, but if you do something like this then downstream code
is safer:

// This is Array[(String, Long)] instead of Array[sql.Row]
ds.groupBy($"name").count.select('name.as[String], 'count.as
[Long]).collect()

Does that seem like a correct understanding of Datasets?

On Sat, Jun 18, 2016 at 6:39 AM, Pedro Rodriguez 
wrote:

> Looks like it was my own fault. I had spark 2.0 cloned/built, but had the
> spark shell in my path so somehow 1.6.1 was being used instead of 2.0.
> Thanks
>
> On Sat, Jun 18, 2016 at 1:16 AM, Takeshi Yamamuro 
> wrote:
>
>> which version you use?
>> I passed in 2.0-preview as follows;
>> ---
>>
>> Spark context available as 'sc' (master = local[*], app id =
>> local-1466234043659).
>>
>> Spark session available as 'spark'.
>>
>> Welcome to
>>
>>     __
>>
>>  / __/__  ___ _/ /__
>>
>> _\ \/ _ \/ _ `/ __/  '_/
>>
>>/___/ .__/\_,_/_/ /_/\_\   version 2.0.0-preview
>>
>>   /_/
>>
>>
>>
>> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java
>> 1.8.0_31)
>>
>> Type in expressions to have them evaluated.
>>
>> Type :help for more information.
>>
>>
>> scala> val ds = Seq[Tuple2[Int, Int]]((1, 0), (2, 0)).toDS
>>
>> hive.metastore.schema.verification is not enabled so recording the schema
>> version 1.2.0
>>
>> ds: org.apache.spark.sql.Dataset[(Int, Int)] = [_1: int, _2: int]
>>
>> scala> ds.groupBy($"_1").count.select($"_1", $"count").show
>>
>> +---+-+
>>
>> | _1|count|
>>
>> +---+-+
>>
>> |  1|1|
>>
>> |  2|1|
>>
>> +---+-+
>>
>>
>>
>> On Sat, Jun 18, 2016 at 3:09 PM, Pedro Rodriguez > > wrote:
>>
>>> I went ahead and downloaded/compiled Spark 2.0 to try your code snippet
>>> Takeshi. It unfortunately doesn't compile.
>>>
>>> scala> val ds = Seq[Tuple2[Int, Int]]((1, 0), (2, 0)).toDS
>>> ds: org.apache.spark.sql.Dataset[(Int, Int)] = [_1: int, _2: int]
>>>
>>> scala> ds.groupBy($"_1").count.select($"_1", $"count").show
>>> :28: error: type mismatch;
>>>  found   : org.apache.spark.sql.ColumnName
>>>  required: org.apache.spark.sql.TypedColumn[(org.apache.spark.sql.Row,
>>> Long),?]
>>>   ds.groupBy($"_1").count.select($"_1", $"count").show
>>>  ^
>>>
>>> I also gave a try to Xinh's suggestion using the code snippet below
>>> (partially from spark docs)
>>> scala> val ds = Seq(Person("Andy", 32), Person("Andy", 2),
>>> Person("Pedro", 24), Person("Bob", 42)).toDS()
>>> scala> ds.groupBy(_.name).count.select($"name".as[String]).show
>>> org.apache.spark.sql.AnalysisException: cannot resolve 'name' given
>>> input columns: [];
>>> scala> ds.groupBy(_.name).count.select($"_1".as[String]).show
>>> org.apache.spark.sql.AnalysisException: cannot resolve 'name' given
>>> input columns: [];
>>> scala> ds.groupBy($"name").count.select($"_1".as[String]).show
>>> org.apache.spark.sql.AnalysisException: cannot resolve '_1' given input
>>> columns: [];
>>>
>>> Looks like there are empty columns for some reason, the code below works
>>> fine for the simple aggregate
>>> scala> ds.groupBy(_.name).count.show
>>>
>>> Would be great to see an idiomatic example of using aggregates like
>>> these mixed with spark.sql.functions.
>>>
>>> Pedro
>>>
>>> On Fri, Jun 17, 2016 at 9:59 PM, Pedro Rodriguez <
>>> ski.rodrig...@gmail.com> wrote:
>>>
 Thanks Xinh and Takeshi,

 I am trying to avoid map since my impression is that this uses a Scala
 closure so is not optimized as well as doing column-wise operations is.

 Looks like the $ notation is the way to go, thanks for the help. Is
 there an explanation of how this works? I imagine it is a method/function
 with its name defined as $ in Scala?

 Lastly, are there prelim Spark 2.0 docs? If there isn't a good
 description/guide of using this syntax I would be willing to contribute
 some documentation.

 Pedro

 On Fri, Jun 17, 2016 at 8:53 PM, Takeshi Yamamuro <
 linguin@gmail.com> wrote:

> Hi,
>
> In 2.0, you can say;
> val ds = Seq[Tuple2[Int, Int]]((1, 0), (2, 0)).toDS
> ds.groupBy($"_1").count.select($"_1", $"count").show
>
>
> // maropu
>
>
> On Sat, Jun 18, 2016 at 7:53 AM, Xinh Huynh 
> wrote:
>
>> Hi Pedro,
>>
>> In 1.6.1, you can do:
>> >> ds.groupBy(_.uid).count().map(_._1)
>> or
>> >> ds.groupBy(_.uid).count().select($"value".as[String])
>>

Many executors with the same ID in web UI (under Executors)?

2016-06-18 Thread Jacek Laskowski
Hi,

This is for Spark on YARN - a 1-node cluster with Spark 2.0.0-SNAPSHOT
(today build)

I can understand that when a stage fails a new executor entry shows up
in web UI under Executors tab (that corresponds to a stage attempt). I
understand that this is to keep the stdout and stderr logs for future
reference.

Why are there multiple executor entries under the same executor IDs?
What are the executor entries exactly? When are the new ones created
(after a Spark application is launched and assigned the
--num-executors executors)?

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Python to Scala

2016-06-18 Thread Sivakumaran S
If you can identify a suitable java example in the spark directory, you can use 
that as a template and convert it to scala code using http://javatoscala.com/ 


Siva

> On 18-Jun-2016, at 6:27 AM, Aakash Basu  wrote:
> 
> I don't have a sound knowledge in Python and on the other hand we are working 
> on Spark on Scala, so I don't think it will be allowed to run PySpark along 
> with it, so the requirement is to convert the code to scala and use it. But 
> I'm finding it difficult.
> 
> Did not find a better forum for help than ours. Hence this mail.
> 
> On 18-Jun-2016 10:39 AM, "Stephen Boesch"  > wrote:
> What are you expecting us to do?  Yash provided a reasonable approach - based 
> on the info you had provided in prior emails.  Otherwise you can convert it 
> from python to spark - or find someone else who feels comfortable to do it.  
> That kind of inquiry would likelybe appropriate on a job board.
> 
> 
> 
> 2016-06-17 21:47 GMT-07:00 Aakash Basu  >:
> Hey,
> 
> Our complete project is in Spark on Scala, I code in Scala for Spark, though 
> am new, but I know it and still learning. But I need help in converting this 
> code to Scala. I've nearly no knowledge in Python, hence, requested the 
> experts here.
> 
> Hope you get me now.
> 
> Thanks,
> Aakash.
> 
> On 18-Jun-2016 10:07 AM, "Yash Sharma"  > wrote:
> You could use pyspark to run the python code on spark directly. That will cut 
> the effort of learning scala.
> 
> https://spark.apache.org/docs/0.9.0/python-programming-guide.html 
> 
> - Thanks, via mobile,  excuse brevity.
> 
> On Jun 18, 2016 2:34 PM, "Aakash Basu"  > wrote:
> Hi all,
> 
> I've a python code, which I want to convert to Scala for using it in a Spark 
> program. I'm not so well acquainted with python and learning scala now. Any 
> Python+Scala expert here? Can someone help me out in this please?
> 
> Thanks & Regards,
> Aakash.
> 
> 



Re: Python to Scala

2016-06-18 Thread Marco Mistroni
Hi
Post the code. I code in python and Scala on spark..I can give u help
though api for Scala and python are practically sameonly difference is
in the python lambda vs Scala inline functions
Hth
On 18 Jun 2016 6:27 am, "Aakash Basu"  wrote:

> I don't have a sound knowledge in Python and on the other hand we are
> working on Spark on Scala, so I don't think it will be allowed to run
> PySpark along with it, so the requirement is to convert the code to scala
> and use it. But I'm finding it difficult.
>
> Did not find a better forum for help than ours. Hence this mail.
> On 18-Jun-2016 10:39 AM, "Stephen Boesch"  wrote:
>
>> What are you expecting us to do?  Yash provided a reasonable approach -
>> based on the info you had provided in prior emails.  Otherwise you can
>> convert it from python to spark - or find someone else who feels
>> comfortable to do it.  That kind of inquiry would likelybe appropriate on a
>> job board.
>>
>>
>>
>> 2016-06-17 21:47 GMT-07:00 Aakash Basu :
>>
>>> Hey,
>>>
>>> Our complete project is in Spark on Scala, I code in Scala for Spark,
>>> though am new, but I know it and still learning. But I need help in
>>> converting this code to Scala. I've nearly no knowledge in Python, hence,
>>> requested the experts here.
>>>
>>> Hope you get me now.
>>>
>>> Thanks,
>>> Aakash.
>>> On 18-Jun-2016 10:07 AM, "Yash Sharma"  wrote:
>>>
 You could use pyspark to run the python code on spark directly. That
 will cut the effort of learning scala.

 https://spark.apache.org/docs/0.9.0/python-programming-guide.html

 - Thanks, via mobile,  excuse brevity.
 On Jun 18, 2016 2:34 PM, "Aakash Basu"  wrote:

> Hi all,
>
> I've a python code, which I want to convert to Scala for using it in a
> Spark program. I'm not so well acquainted with python and learning scala
> now. Any Python+Scala expert here? Can someone help me out in this please?
>
> Thanks & Regards,
> Aakash.
>

>>


Re: Dataset Select Function after Aggregate Error

2016-06-18 Thread Pedro Rodriguez
Looks like it was my own fault. I had spark 2.0 cloned/built, but had the
spark shell in my path so somehow 1.6.1 was being used instead of 2.0.
Thanks

On Sat, Jun 18, 2016 at 1:16 AM, Takeshi Yamamuro 
wrote:

> which version you use?
> I passed in 2.0-preview as follows;
> ---
>
> Spark context available as 'sc' (master = local[*], app id =
> local-1466234043659).
>
> Spark session available as 'spark'.
>
> Welcome to
>
>     __
>
>  / __/__  ___ _/ /__
>
> _\ \/ _ \/ _ `/ __/  '_/
>
>/___/ .__/\_,_/_/ /_/\_\   version 2.0.0-preview
>
>   /_/
>
>
>
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java
> 1.8.0_31)
>
> Type in expressions to have them evaluated.
>
> Type :help for more information.
>
>
> scala> val ds = Seq[Tuple2[Int, Int]]((1, 0), (2, 0)).toDS
>
> hive.metastore.schema.verification is not enabled so recording the schema
> version 1.2.0
>
> ds: org.apache.spark.sql.Dataset[(Int, Int)] = [_1: int, _2: int]
>
> scala> ds.groupBy($"_1").count.select($"_1", $"count").show
>
> +---+-+
>
> | _1|count|
>
> +---+-+
>
> |  1|1|
>
> |  2|1|
>
> +---+-+
>
>
>
> On Sat, Jun 18, 2016 at 3:09 PM, Pedro Rodriguez 
> wrote:
>
>> I went ahead and downloaded/compiled Spark 2.0 to try your code snippet
>> Takeshi. It unfortunately doesn't compile.
>>
>> scala> val ds = Seq[Tuple2[Int, Int]]((1, 0), (2, 0)).toDS
>> ds: org.apache.spark.sql.Dataset[(Int, Int)] = [_1: int, _2: int]
>>
>> scala> ds.groupBy($"_1").count.select($"_1", $"count").show
>> :28: error: type mismatch;
>>  found   : org.apache.spark.sql.ColumnName
>>  required: org.apache.spark.sql.TypedColumn[(org.apache.spark.sql.Row,
>> Long),?]
>>   ds.groupBy($"_1").count.select($"_1", $"count").show
>>  ^
>>
>> I also gave a try to Xinh's suggestion using the code snippet below
>> (partially from spark docs)
>> scala> val ds = Seq(Person("Andy", 32), Person("Andy", 2),
>> Person("Pedro", 24), Person("Bob", 42)).toDS()
>> scala> ds.groupBy(_.name).count.select($"name".as[String]).show
>> org.apache.spark.sql.AnalysisException: cannot resolve 'name' given input
>> columns: [];
>> scala> ds.groupBy(_.name).count.select($"_1".as[String]).show
>> org.apache.spark.sql.AnalysisException: cannot resolve 'name' given input
>> columns: [];
>> scala> ds.groupBy($"name").count.select($"_1".as[String]).show
>> org.apache.spark.sql.AnalysisException: cannot resolve '_1' given input
>> columns: [];
>>
>> Looks like there are empty columns for some reason, the code below works
>> fine for the simple aggregate
>> scala> ds.groupBy(_.name).count.show
>>
>> Would be great to see an idiomatic example of using aggregates like these
>> mixed with spark.sql.functions.
>>
>> Pedro
>>
>> On Fri, Jun 17, 2016 at 9:59 PM, Pedro Rodriguez > > wrote:
>>
>>> Thanks Xinh and Takeshi,
>>>
>>> I am trying to avoid map since my impression is that this uses a Scala
>>> closure so is not optimized as well as doing column-wise operations is.
>>>
>>> Looks like the $ notation is the way to go, thanks for the help. Is
>>> there an explanation of how this works? I imagine it is a method/function
>>> with its name defined as $ in Scala?
>>>
>>> Lastly, are there prelim Spark 2.0 docs? If there isn't a good
>>> description/guide of using this syntax I would be willing to contribute
>>> some documentation.
>>>
>>> Pedro
>>>
>>> On Fri, Jun 17, 2016 at 8:53 PM, Takeshi Yamamuro >> > wrote:
>>>
 Hi,

 In 2.0, you can say;
 val ds = Seq[Tuple2[Int, Int]]((1, 0), (2, 0)).toDS
 ds.groupBy($"_1").count.select($"_1", $"count").show


 // maropu


 On Sat, Jun 18, 2016 at 7:53 AM, Xinh Huynh 
 wrote:

> Hi Pedro,
>
> In 1.6.1, you can do:
> >> ds.groupBy(_.uid).count().map(_._1)
> or
> >> ds.groupBy(_.uid).count().select($"value".as[String])
>
> It doesn't have the exact same syntax as for DataFrame.
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset
>
> It might be different in 2.0.
>
> Xinh
>
> On Fri, Jun 17, 2016 at 3:33 PM, Pedro Rodriguez <
> ski.rodrig...@gmail.com> wrote:
>
>> Hi All,
>>
>> I am working on using Datasets in 1.6.1 and eventually 2.0 when its
>> released.
>>
>> I am running the aggregate code below where I have a dataset where
>> the row has a field uid:
>>
>> ds.groupBy(_.uid).count()
>> // res0: org.apache.spark.sql.Dataset[(String, Long)] = [_1: string,
>> _2: bigint]
>>
>> This works as expected, however, attempts to run select statements
>> after fails:
>> ds.groupBy(_.uid).count().select(_._1)
>> // error: missing parameter type for expanded function ((x$2) =>
>> x$2._1)
>> 

Re: How to cause a stage to fail (using spark-shell)?

2016-06-18 Thread Jacek Laskowski
Hi,

Following up on this question, is a stage considered failed only when
there is a FetchFailed exception? Can I have a failed stage with only
a single-stage job?

Appreciate any help on this...(as my family doesn't like me spending
the weekend with Spark :))

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Sat, Jun 18, 2016 at 11:53 AM, Jacek Laskowski  wrote:
> Hi,
>
> I'm trying to see some stats about failing stages in web UI and want
> to "create" few failed stages. Is this possible using spark-shell at
> all? Which setup of Spark/spark-shell would allow for such a scenario.
>
> I could write a Scala code if that's the only way to have failing stages.
>
> Please guide. Thanks.
>
> /me on to reviewing the Spark code...
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Making spark read from sources other than HDFS

2016-06-18 Thread Mich Talebzadeh
Spark is capable of reading data from a variety of sources including normal
non HDFS RDBMS databases.

This will require JDBC connection for that source which is obviously not
HDFS.

Which sort of storage do you have in mind. Can you access it via JDBC, ODBC
etc?

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 18 June 2016 at 10:52, Ramprakash Ramamoorthy  wrote:

> Hi team,
>
> I'm running spark in cluster mode.
>
> We have a custom file storage in our organisation. Can I plug in data from
> these custom sources (Non HDFS like...)
>
> Can you please shed some light in this aspect, like where do I start,
> should I have to tweak the spark source code (Where exactly do I look out
> for?)
>
> Thank you.
>
> --
> With Thanks and Regards,
> Ramprakash Ramamoorthy,
> Chennai, India
>


How to cause a stage to fail (using spark-shell)?

2016-06-18 Thread Jacek Laskowski
Hi,

I'm trying to see some stats about failing stages in web UI and want
to "create" few failed stages. Is this possible using spark-shell at
all? Which setup of Spark/spark-shell would allow for such a scenario.

I could write a Scala code if that's the only way to have failing stages.

Please guide. Thanks.

/me on to reviewing the Spark code...

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Making spark read from sources other than HDFS

2016-06-18 Thread Ramprakash Ramamoorthy
Hi team,

I'm running spark in cluster mode.

We have a custom file storage in our organisation. Can I plug in data from
these custom sources (Non HDFS like...)

Can you please shed some light in this aspect, like where do I start,
should I have to tweak the spark source code (Where exactly do I look out
for?)

Thank you.

-- 
With Thanks and Regards,
Ramprakash Ramamoorthy,
Chennai, India


Re: Python to Scala

2016-06-18 Thread ayan guha
Post the code..some one would be able to help (your truly included)

On Sat, Jun 18, 2016 at 4:13 PM, Yash Sharma  wrote:

> Couple of things that can work-
> - If you know the logic- just forget the python script and write it in
> java/scala from scratch
> - If you have python functions and libraries used- Pyspark is probably the
> best bet.
> - If you have specific questions on how to solve a particular
> implementation issue you are facing- ask it here :)
>
> Apart from that its really difficult to understand what you are asking.
>
> - Thanks, via mobile,  excuse brevity.
> On Jun 18, 2016 3:27 PM, "Aakash Basu"  wrote:
>
>> I don't have a sound knowledge in Python and on the other hand we are
>> working on Spark on Scala, so I don't think it will be allowed to run
>> PySpark along with it, so the requirement is to convert the code to scala
>> and use it. But I'm finding it difficult.
>>
>> Did not find a better forum for help than ours. Hence this mail.
>> On 18-Jun-2016 10:39 AM, "Stephen Boesch"  wrote:
>>
>>> What are you expecting us to do?  Yash provided a reasonable approach -
>>> based on the info you had provided in prior emails.  Otherwise you can
>>> convert it from python to spark - or find someone else who feels
>>> comfortable to do it.  That kind of inquiry would likelybe appropriate on a
>>> job board.
>>>
>>>
>>>
>>> 2016-06-17 21:47 GMT-07:00 Aakash Basu :
>>>
 Hey,

 Our complete project is in Spark on Scala, I code in Scala for Spark,
 though am new, but I know it and still learning. But I need help in
 converting this code to Scala. I've nearly no knowledge in Python, hence,
 requested the experts here.

 Hope you get me now.

 Thanks,
 Aakash.
 On 18-Jun-2016 10:07 AM, "Yash Sharma"  wrote:

> You could use pyspark to run the python code on spark directly. That
> will cut the effort of learning scala.
>
> https://spark.apache.org/docs/0.9.0/python-programming-guide.html
>
> - Thanks, via mobile,  excuse brevity.
> On Jun 18, 2016 2:34 PM, "Aakash Basu"  wrote:
>
>> Hi all,
>>
>> I've a python code, which I want to convert to Scala for using it in
>> a Spark program. I'm not so well acquainted with python and learning 
>> scala
>> now. Any Python+Scala expert here? Can someone help me out in this 
>> please?
>>
>> Thanks & Regards,
>> Aakash.
>>
>
>>>


-- 
Best Regards,
Ayan Guha


Re: Dataset Select Function after Aggregate Error

2016-06-18 Thread Takeshi Yamamuro
which version you use?
I passed in 2.0-preview as follows;
---

Spark context available as 'sc' (master = local[*], app id =
local-1466234043659).

Spark session available as 'spark'.

Welcome to

    __

 / __/__  ___ _/ /__

_\ \/ _ \/ _ `/ __/  '_/

   /___/ .__/\_,_/_/ /_/\_\   version 2.0.0-preview

  /_/



Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java
1.8.0_31)

Type in expressions to have them evaluated.

Type :help for more information.


scala> val ds = Seq[Tuple2[Int, Int]]((1, 0), (2, 0)).toDS

hive.metastore.schema.verification is not enabled so recording the schema
version 1.2.0

ds: org.apache.spark.sql.Dataset[(Int, Int)] = [_1: int, _2: int]

scala> ds.groupBy($"_1").count.select($"_1", $"count").show

+---+-+

| _1|count|

+---+-+

|  1|1|

|  2|1|

+---+-+



On Sat, Jun 18, 2016 at 3:09 PM, Pedro Rodriguez 
wrote:

> I went ahead and downloaded/compiled Spark 2.0 to try your code snippet
> Takeshi. It unfortunately doesn't compile.
>
> scala> val ds = Seq[Tuple2[Int, Int]]((1, 0), (2, 0)).toDS
> ds: org.apache.spark.sql.Dataset[(Int, Int)] = [_1: int, _2: int]
>
> scala> ds.groupBy($"_1").count.select($"_1", $"count").show
> :28: error: type mismatch;
>  found   : org.apache.spark.sql.ColumnName
>  required: org.apache.spark.sql.TypedColumn[(org.apache.spark.sql.Row,
> Long),?]
>   ds.groupBy($"_1").count.select($"_1", $"count").show
>  ^
>
> I also gave a try to Xinh's suggestion using the code snippet below
> (partially from spark docs)
> scala> val ds = Seq(Person("Andy", 32), Person("Andy", 2), Person("Pedro",
> 24), Person("Bob", 42)).toDS()
> scala> ds.groupBy(_.name).count.select($"name".as[String]).show
> org.apache.spark.sql.AnalysisException: cannot resolve 'name' given input
> columns: [];
> scala> ds.groupBy(_.name).count.select($"_1".as[String]).show
> org.apache.spark.sql.AnalysisException: cannot resolve 'name' given input
> columns: [];
> scala> ds.groupBy($"name").count.select($"_1".as[String]).show
> org.apache.spark.sql.AnalysisException: cannot resolve '_1' given input
> columns: [];
>
> Looks like there are empty columns for some reason, the code below works
> fine for the simple aggregate
> scala> ds.groupBy(_.name).count.show
>
> Would be great to see an idiomatic example of using aggregates like these
> mixed with spark.sql.functions.
>
> Pedro
>
> On Fri, Jun 17, 2016 at 9:59 PM, Pedro Rodriguez 
> wrote:
>
>> Thanks Xinh and Takeshi,
>>
>> I am trying to avoid map since my impression is that this uses a Scala
>> closure so is not optimized as well as doing column-wise operations is.
>>
>> Looks like the $ notation is the way to go, thanks for the help. Is there
>> an explanation of how this works? I imagine it is a method/function with
>> its name defined as $ in Scala?
>>
>> Lastly, are there prelim Spark 2.0 docs? If there isn't a good
>> description/guide of using this syntax I would be willing to contribute
>> some documentation.
>>
>> Pedro
>>
>> On Fri, Jun 17, 2016 at 8:53 PM, Takeshi Yamamuro 
>> wrote:
>>
>>> Hi,
>>>
>>> In 2.0, you can say;
>>> val ds = Seq[Tuple2[Int, Int]]((1, 0), (2, 0)).toDS
>>> ds.groupBy($"_1").count.select($"_1", $"count").show
>>>
>>>
>>> // maropu
>>>
>>>
>>> On Sat, Jun 18, 2016 at 7:53 AM, Xinh Huynh 
>>> wrote:
>>>
 Hi Pedro,

 In 1.6.1, you can do:
 >> ds.groupBy(_.uid).count().map(_._1)
 or
 >> ds.groupBy(_.uid).count().select($"value".as[String])

 It doesn't have the exact same syntax as for DataFrame.
 http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset

 It might be different in 2.0.

 Xinh

 On Fri, Jun 17, 2016 at 3:33 PM, Pedro Rodriguez <
 ski.rodrig...@gmail.com> wrote:

> Hi All,
>
> I am working on using Datasets in 1.6.1 and eventually 2.0 when its
> released.
>
> I am running the aggregate code below where I have a dataset where the
> row has a field uid:
>
> ds.groupBy(_.uid).count()
> // res0: org.apache.spark.sql.Dataset[(String, Long)] = [_1: string,
> _2: bigint]
>
> This works as expected, however, attempts to run select statements
> after fails:
> ds.groupBy(_.uid).count().select(_._1)
> // error: missing parameter type for expanded function ((x$2) =>
> x$2._1)
> ds.groupBy(_.uid).count().select(_._1)
>
> I have tried several variants, but nothing seems to work. Below is the
> equivalent Dataframe code which works as expected:
> df.groupBy("uid").count().select("uid")
>
> Thanks!
> --
> Pedro Rodriguez
> PhD Student in Distributed Machine Learning | CU Boulder
> UC Berkeley AMPLab Alumni
>
> ski.rodrig...@gmail.com | pedrorodriguez.io | 

Re: Python to Scala

2016-06-18 Thread Yash Sharma
Couple of things that can work-
- If you know the logic- just forget the python script and write it in
java/scala from scratch
- If you have python functions and libraries used- Pyspark is probably the
best bet.
- If you have specific questions on how to solve a particular
implementation issue you are facing- ask it here :)

Apart from that its really difficult to understand what you are asking.

- Thanks, via mobile,  excuse brevity.
On Jun 18, 2016 3:27 PM, "Aakash Basu"  wrote:

> I don't have a sound knowledge in Python and on the other hand we are
> working on Spark on Scala, so I don't think it will be allowed to run
> PySpark along with it, so the requirement is to convert the code to scala
> and use it. But I'm finding it difficult.
>
> Did not find a better forum for help than ours. Hence this mail.
> On 18-Jun-2016 10:39 AM, "Stephen Boesch"  wrote:
>
>> What are you expecting us to do?  Yash provided a reasonable approach -
>> based on the info you had provided in prior emails.  Otherwise you can
>> convert it from python to spark - or find someone else who feels
>> comfortable to do it.  That kind of inquiry would likelybe appropriate on a
>> job board.
>>
>>
>>
>> 2016-06-17 21:47 GMT-07:00 Aakash Basu :
>>
>>> Hey,
>>>
>>> Our complete project is in Spark on Scala, I code in Scala for Spark,
>>> though am new, but I know it and still learning. But I need help in
>>> converting this code to Scala. I've nearly no knowledge in Python, hence,
>>> requested the experts here.
>>>
>>> Hope you get me now.
>>>
>>> Thanks,
>>> Aakash.
>>> On 18-Jun-2016 10:07 AM, "Yash Sharma"  wrote:
>>>
 You could use pyspark to run the python code on spark directly. That
 will cut the effort of learning scala.

 https://spark.apache.org/docs/0.9.0/python-programming-guide.html

 - Thanks, via mobile,  excuse brevity.
 On Jun 18, 2016 2:34 PM, "Aakash Basu"  wrote:

> Hi all,
>
> I've a python code, which I want to convert to Scala for using it in a
> Spark program. I'm not so well acquainted with python and learning scala
> now. Any Python+Scala expert here? Can someone help me out in this please?
>
> Thanks & Regards,
> Aakash.
>

>>


Re: Dataset Select Function after Aggregate Error

2016-06-18 Thread Takeshi Yamamuro
'$' is just replaced with 'Column' inside.

// maropu

On Sat, Jun 18, 2016 at 12:59 PM, Pedro Rodriguez 
wrote:

> Thanks Xinh and Takeshi,
>
> I am trying to avoid map since my impression is that this uses a Scala
> closure so is not optimized as well as doing column-wise operations is.
>
> Looks like the $ notation is the way to go, thanks for the help. Is there
> an explanation of how this works? I imagine it is a method/function with
> its name defined as $ in Scala?
>
> Lastly, are there prelim Spark 2.0 docs? If there isn't a good
> description/guide of using this syntax I would be willing to contribute
> some documentation.
>
> Pedro
>
> On Fri, Jun 17, 2016 at 8:53 PM, Takeshi Yamamuro 
> wrote:
>
>> Hi,
>>
>> In 2.0, you can say;
>> val ds = Seq[Tuple2[Int, Int]]((1, 0), (2, 0)).toDS
>> ds.groupBy($"_1").count.select($"_1", $"count").show
>>
>>
>> // maropu
>>
>>
>> On Sat, Jun 18, 2016 at 7:53 AM, Xinh Huynh  wrote:
>>
>>> Hi Pedro,
>>>
>>> In 1.6.1, you can do:
>>> >> ds.groupBy(_.uid).count().map(_._1)
>>> or
>>> >> ds.groupBy(_.uid).count().select($"value".as[String])
>>>
>>> It doesn't have the exact same syntax as for DataFrame.
>>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset
>>>
>>> It might be different in 2.0.
>>>
>>> Xinh
>>>
>>> On Fri, Jun 17, 2016 at 3:33 PM, Pedro Rodriguez <
>>> ski.rodrig...@gmail.com> wrote:
>>>
 Hi All,

 I am working on using Datasets in 1.6.1 and eventually 2.0 when its
 released.

 I am running the aggregate code below where I have a dataset where the
 row has a field uid:

 ds.groupBy(_.uid).count()
 // res0: org.apache.spark.sql.Dataset[(String, Long)] = [_1: string,
 _2: bigint]

 This works as expected, however, attempts to run select statements
 after fails:
 ds.groupBy(_.uid).count().select(_._1)
 // error: missing parameter type for expanded function ((x$2) =>
 x$2._1)
 ds.groupBy(_.uid).count().select(_._1)

 I have tried several variants, but nothing seems to work. Below is the
 equivalent Dataframe code which works as expected:
 df.groupBy("uid").count().select("uid")

 Thanks!
 --
 Pedro Rodriguez
 PhD Student in Distributed Machine Learning | CU Boulder
 UC Berkeley AMPLab Alumni

 ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
 Github: github.com/EntilZha | LinkedIn:
 https://www.linkedin.com/in/pedrorodriguezscience


>>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>
>
>
> --
> Pedro Rodriguez
> PhD Student in Distributed Machine Learning | CU Boulder
> UC Berkeley AMPLab Alumni
>
> ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
> Github: github.com/EntilZha | LinkedIn:
> https://www.linkedin.com/in/pedrorodriguezscience
>
>


-- 
---
Takeshi Yamamuro