subject:"Spark and Scala"

Custom Session Windowing in Spark using Scala/Python

2023-08-03 Thread Ravi Teja

Hi,

I am new to Spark and looking for help regarding the session windowing

in Spark. I want to create session windows on a user activity stream with a
gap duration of `x` minutes and also have a maximum window size of `y`
hours. I cannot let spark the aggregating the user events for days before
submitting them to the next step. For example, I want Spark to submit the
session window if there's no activity for 30 minutes but I also want Spark
to submit the session window once it hits the 5 hour limit. I'd like the
solution to be based only on event time and not processing time. Any help
with this is greatly appreciated. Thank you.

-- 
Best,
RaviTeja

Re: Integration testing Framework Spark SQL Scala

2020-11-02 Thread Lars Albertsson

Hi,

Sorry for the very slow reply - I am far behind in my mailing list
subscriptions.

You'll find a few slides covering the topic in this presentation:
https://www.slideshare.net/lallea/test-strategies-for-data-processing-pipelines-67244458

Video here: https://vimeo.com/192429554

Regards,

Lars Albertsson
Data engineering entrepreneur
www.scling.com, www.mapflat.com
https://twitter.com/lalleal
+46 70 7687109

On Tue, Feb 25, 2020 at 7:46 PM Ruijing Li  wrote:
>
> Just wanted to follow up on this. If anyone has any advice, I’d be interested 
> in learning more!
>
> On Thu, Feb 20, 2020 at 6:09 PM Ruijing Li  wrote:
>>
>> Hi all,
>>
>> I’m interested in hearing the community’s thoughts on best practices to do 
>> integration testing for spark sql jobs. We run a lot of our jobs with cloud 
>> infrastructure and hdfs - this makes debugging a challenge for us, 
>> especially with problems that don’t occur from just initializing a 
>> sparksession locally or testing with spark-shell. Ideally, we’d like some 
>> sort of docker container emulating hdfs and spark cluster mode, that you can 
>> run locally.
>>
>> Any test framework, tips, or examples people can share? Thanks!
>> --
>> Cheers,
>> Ruijing Li
>
> --
> Cheers,
> Ruijing Li

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

elasticsearch-hadoop is not compatible with spark 3.0( scala 2.12)

2020-06-23 Thread murat migdisoglu

Hi,

I'm testing our codebase against spark 3.0.0 stack and I realized that
elasticsearch-hadoop libraries are built against scala 2.11 and thus are
not working with spark 3.0.0. (and probably 2.4.2).

Is there anybody else facing this issue? How did you solve it?
The PR on the ES library is open since Nov 2019...

Thank you

Re: Integration testing Framework Spark SQL Scala

2020-02-25 Thread Ruijing Li

Just wanted to follow up on this. If anyone has any advice, I’d be
interested in learning more!

On Thu, Feb 20, 2020 at 6:09 PM Ruijing Li  wrote:

> Hi all,
>
> I’m interested in hearing the community’s thoughts on best practices to do
> integration testing for spark sql jobs. We run a lot of our jobs with cloud
> infrastructure and hdfs - this makes debugging a challenge for us,
> especially with problems that don’t occur from just initializing a
> sparksession locally or testing with spark-shell. Ideally, we’d like some
> sort of docker container emulating hdfs and spark cluster mode, that you
> can run locally.
>
> Any test framework, tips, or examples people can share? Thanks!
> --
> Cheers,
> Ruijing Li
>
-- 
Cheers,
Ruijing Li

Integration testing Framework Spark SQL Scala

2020-02-20 Thread Ruijing Li

Hi all,

I’m interested in hearing the community’s thoughts on best practices to do
integration testing for spark sql jobs. We run a lot of our jobs with cloud
infrastructure and hdfs - this makes debugging a challenge for us,
especially with problems that don’t occur from just initializing a
sparksession locally or testing with spark-shell. Ideally, we’d like some
sort of docker container emulating hdfs and spark cluster mode, that you
can run locally.

Any test framework, tips, or examples people can share? Thanks!
-- 
Cheers,
Ruijing Li

Spark 2.4 scala 2.12 Regular Expressions Approach

2019-07-15 Thread anbutech

Hi All,

Could you please help me to fix the below issue using spark 2.4 , scala 2.12 

How do we extract's the multiple values in the given file name pattern using 
spark/scala regular expression.please 
give me some idea on the below approach.

object Driver {

private val filePattern =
xyzabc_source2target_adver_1stvalue_([a-zA-Z0-9]+)_2ndvalue_([a-zA-Z0-9]+)_3rdvalue_([a-zA-Z0-9]+)_4thvalue_
([a-zA-Z0-9]+)_5thvalue_([a-zA-Z0-9]+)_6thvalue_([a-zA-Z0-9]+)_7thvalue_([a-zA-Z0-9]+)".r

How to get all 7 values like "([a-zA-Z0-9]+)"  from above regular expression
pattern using spark scala 
and assigned it to the below processing method  , i.e. case class schema
fields

def processing(x:Dataset[someData]){

x.map{
e =>

caseClassSchema(
Field1 = 1stvalue
Field2 = 2ndvalue
Field3 = 3rdvalue
Field4 = 4thvalue
Field5 = 5thvalue
Field6 = 6thvalue
Field7 = 7thvalue
)
}
}


Thanks
Anbu




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Best notebook for developing for apache spark using scala on Amazon EMR Cluster

2019-05-01 Thread Jeff Zhang

You can configure zeppelin to store your notes in S3

http://zeppelin.apache.org/docs/0.8.1/setup/storage/storage.html#notebook-storage-in-s3





V0lleyBallJunki3  于2019年5月1日周三 上午5:26写道：

> Hello. I am using Zeppelin on Amazon EMR cluster while developing Apache
> Spark programs in Scala. The problem is that once that cluster is destroyed
> I lose all the notebooks on it. So over a period of time I have a lot of
> notebooks that require to be manually  exported into my local disk and from
> there imported to each new EMR cluster I create. Is there a notebook
> repository or tool that I can use where I can keep all my notebooks in a
> folder and access them even on new emr clusters. I know Jupyter is there
> but
> it doesn't support auto-complete for Scala.
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

-- 
Best Regards

Jeff Zhang

Best notebook for developing for apache spark using scala on Amazon EMR Cluster

2019-04-30 Thread V0lleyBallJunki3

Hello. I am using Zeppelin on Amazon EMR cluster while developing Apache
Spark programs in Scala. The problem is that once that cluster is destroyed
I lose all the notebooks on it. So over a period of time I have a lot of
notebooks that require to be manually  exported into my local disk and from
there imported to each new EMR cluster I create. Is there a notebook
repository or tool that I can use where I can keep all my notebooks in a
folder and access them even on new emr clusters. I know Jupyter is there but
it doesn't support auto-complete for Scala.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Spark with Scala : understanding closures or best way to take udf registrations' code out of main and put in utils

2018-08-21 Thread aastha

This is more of a Scala  concept doubt than Spark. I have this Spark
initialization code :

object EntryPoint {
   val spark = SparkFactory.createSparkSession(...
   val funcsSingleton = ContextSingleton[CustomFunctions] { new
CustomFunctions(Some(hashConf)) }
   lazy val funcs = funcsSingleton.get
   //this part I want moved to another place since there are many many
UDFs
   spark.udf.register("funcName", udf {funcName _ })
}
The other class, CustomFunctions looks like this 

class CustomFunctions(val hashConfig: Option[HashConfig], sark:
Option[SparkSession] = None) {
 val funcUdf = udf { funcName _ }
 def funcName(colValue: String) = withDefinedOpt(hashConfig) { c =>
 ...}
}
^ class is wrapped in Serializable interface using ContextSingleton which is
defined like so

class ContextSingleton[T: ClassTag](constructor: => T) extends AnyRef
with Serializable {
   val uuid = UUID.randomUUID.toString
   @transient private lazy val instance =
ContextSingleton.pool.synchronized {
ContextSingleton.pool.getOrElseUpdate(uuid, constructor)
   }
   def get = instance.asInstanceOf[T]
}
object ContextSingleton {
   private val pool = new TrieMap[String, Any]()
   def apply[T: ClassTag](constructor: => T): ContextSingleton[T] = new
ContextSingleton[T](constructor)
   def poolSize: Int = pool.size
   def poolClear(): Unit = pool.clear()
}

Now to my problem, I want to not have to explicitly register the udfs as
done in the EntryPoint app. I create all udfs as needed in my
CustomFunctions class and then register dynamically only the ones that I
read from user provided config. What would be the best way to achieve it?
Also, I want to register the required udfs outside the main app but that
throws me the infamous `TaskNotSerializable` exception. Serializing the big
CustomFunctions is not a good idea, hence wrapped it up in ContextSingleton
but my problem of registering udfs outside cannot be solved that way. Please
suggest the right approach. 
 







--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Run Python User Defined Functions / code in Spark with Scala Codebase

2018-07-16 Thread Gourav Sengupta

Hi,

I am not very sure if SPARK data frames apply to your used case, if it does
please give a try by creating a UDF in Python and check whether you can
call it in Scala or not using select and expr.

Regards,
Gourav Sengupta

On Mon, Jul 16, 2018 at 5:32 AM, Chetan Khatri 
wrote:

> Hello Jayant,
>
> Thanks for great OSS Contribution :)
>
> On Thu, Jul 12, 2018 at 1:36 PM, Jayant Shekhar 
> wrote:
>
>> Hello Chetan,
>>
>> Sorry missed replying earlier. You can find some sample code here :
>>
>> http://sparkflows.readthedocs.io/en/latest/user-guide/python
>> /pipe-python.html
>>
>> We will continue adding more there.
>>
>> Feel free to ping me directly in case of questions.
>>
>> Thanks,
>> Jayant
>>
>>
>> On Mon, Jul 9, 2018 at 9:56 PM, Chetan Khatri <
>> chetan.opensou...@gmail.com> wrote:
>>
>>> Hello Jayant,
>>>
>>> Thank you so much for suggestion. My view was to  use Python function as
>>> transformation which can take couple of column names and return object.
>>> which you explained. would that possible to point me to similiar codebase
>>> example.
>>>
>>> Thanks.
>>>
>>> On Fri, Jul 6, 2018 at 2:56 AM, Jayant Shekhar 
>>> wrote:
>>>
>>>> Hello Chetan,
>>>>
>>>> We have currently done it with .pipe(.py) as Prem suggested.
>>>>
>>>> That passes the RDD as CSV strings to the python script. The python
>>>> script can either process it line by line, create the result and return it
>>>> back. Or create things like Pandas Dataframe for processing and finally
>>>> write the results back.
>>>>
>>>> In the Spark/Scala/Java code, you get an RDD of string, which we
>>>> convert back to a Dataframe.
>>>>
>>>> Feel free to ping me directly in case of questions.
>>>>
>>>> Thanks,
>>>> Jayant
>>>>
>>>>
>>>> On Thu, Jul 5, 2018 at 3:39 AM, Chetan Khatri <
>>>> chetan.opensou...@gmail.com> wrote:
>>>>
>>>>> Prem sure, Thanks for suggestion.
>>>>>
>>>>> On Wed, Jul 4, 2018 at 8:38 PM, Prem Sure 
>>>>> wrote:
>>>>>
>>>>>> try .pipe(.py) on RDD
>>>>>>
>>>>>> Thanks,
>>>>>> Prem
>>>>>>
>>>>>> On Wed, Jul 4, 2018 at 7:59 PM, Chetan Khatri <
>>>>>> chetan.opensou...@gmail.com> wrote:
>>>>>>
>>>>>>> Can someone please suggest me , thanks
>>>>>>>
>>>>>>> On Tue 3 Jul, 2018, 5:28 PM Chetan Khatri, <
>>>>>>> chetan.opensou...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hello Dear Spark User / Dev,
>>>>>>>>
>>>>>>>> I would like to pass Python user defined function to Spark Job
>>>>>>>> developed using Scala and return value of that function would be 
>>>>>>>> returned
>>>>>>>> to DF / Dataset API.
>>>>>>>>
>>>>>>>> Can someone please guide me, which would be best approach to do
>>>>>>>> this. Python function would be mostly transformation function. Also 
>>>>>>>> would
>>>>>>>> like to pass Java Function as a String to Spark / Scala job and it 
>>>>>>>> applies
>>>>>>>> to RDD / Data Frame and should return RDD / Data Frame.
>>>>>>>>
>>>>>>>> Thank you.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Run Python User Defined Functions / code in Spark with Scala Codebase

2018-07-15 Thread Chetan Khatri

Hello Jayant,

Thanks for great OSS Contribution :)

On Thu, Jul 12, 2018 at 1:36 PM, Jayant Shekhar 
wrote:

> Hello Chetan,
>
> Sorry missed replying earlier. You can find some sample code here :
>
> http://sparkflows.readthedocs.io/en/latest/user-guide/
> python/pipe-python.html
>
> We will continue adding more there.
>
> Feel free to ping me directly in case of questions.
>
> Thanks,
> Jayant
>
>
> On Mon, Jul 9, 2018 at 9:56 PM, Chetan Khatri  > wrote:
>
>> Hello Jayant,
>>
>> Thank you so much for suggestion. My view was to  use Python function as
>> transformation which can take couple of column names and return object.
>> which you explained. would that possible to point me to similiar codebase
>> example.
>>
>> Thanks.
>>
>> On Fri, Jul 6, 2018 at 2:56 AM, Jayant Shekhar 
>> wrote:
>>
>>> Hello Chetan,
>>>
>>> We have currently done it with .pipe(.py) as Prem suggested.
>>>
>>> That passes the RDD as CSV strings to the python script. The python
>>> script can either process it line by line, create the result and return it
>>> back. Or create things like Pandas Dataframe for processing and finally
>>> write the results back.
>>>
>>> In the Spark/Scala/Java code, you get an RDD of string, which we convert
>>> back to a Dataframe.
>>>
>>> Feel free to ping me directly in case of questions.
>>>
>>> Thanks,
>>> Jayant
>>>
>>>
>>> On Thu, Jul 5, 2018 at 3:39 AM, Chetan Khatri <
>>> chetan.opensou...@gmail.com> wrote:
>>>
>>>> Prem sure, Thanks for suggestion.
>>>>
>>>> On Wed, Jul 4, 2018 at 8:38 PM, Prem Sure 
>>>> wrote:
>>>>
>>>>> try .pipe(.py) on RDD
>>>>>
>>>>> Thanks,
>>>>> Prem
>>>>>
>>>>> On Wed, Jul 4, 2018 at 7:59 PM, Chetan Khatri <
>>>>> chetan.opensou...@gmail.com> wrote:
>>>>>
>>>>>> Can someone please suggest me , thanks
>>>>>>
>>>>>> On Tue 3 Jul, 2018, 5:28 PM Chetan Khatri, <
>>>>>> chetan.opensou...@gmail.com> wrote:
>>>>>>
>>>>>>> Hello Dear Spark User / Dev,
>>>>>>>
>>>>>>> I would like to pass Python user defined function to Spark Job
>>>>>>> developed using Scala and return value of that function would be 
>>>>>>> returned
>>>>>>> to DF / Dataset API.
>>>>>>>
>>>>>>> Can someone please guide me, which would be best approach to do
>>>>>>> this. Python function would be mostly transformation function. Also 
>>>>>>> would
>>>>>>> like to pass Java Function as a String to Spark / Scala job and it 
>>>>>>> applies
>>>>>>> to RDD / Data Frame and should return RDD / Data Frame.
>>>>>>>
>>>>>>> Thank you.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Run Python User Defined Functions / code in Spark with Scala Codebase

2018-07-12 Thread Jayant Shekhar

Hello Chetan,

Sorry missed replying earlier. You can find some sample code here :

http://sparkflows.readthedocs.io/en/latest/user-guide/python/pipe-python.html

We will continue adding more there.

Feel free to ping me directly in case of questions.

Thanks,
Jayant


On Mon, Jul 9, 2018 at 9:56 PM, Chetan Khatri 
wrote:

> Hello Jayant,
>
> Thank you so much for suggestion. My view was to  use Python function as
> transformation which can take couple of column names and return object.
> which you explained. would that possible to point me to similiar codebase
> example.
>
> Thanks.
>
> On Fri, Jul 6, 2018 at 2:56 AM, Jayant Shekhar 
> wrote:
>
>> Hello Chetan,
>>
>> We have currently done it with .pipe(.py) as Prem suggested.
>>
>> That passes the RDD as CSV strings to the python script. The python
>> script can either process it line by line, create the result and return it
>> back. Or create things like Pandas Dataframe for processing and finally
>> write the results back.
>>
>> In the Spark/Scala/Java code, you get an RDD of string, which we convert
>> back to a Dataframe.
>>
>> Feel free to ping me directly in case of questions.
>>
>> Thanks,
>> Jayant
>>
>>
>> On Thu, Jul 5, 2018 at 3:39 AM, Chetan Khatri <
>> chetan.opensou...@gmail.com> wrote:
>>
>>> Prem sure, Thanks for suggestion.
>>>
>>> On Wed, Jul 4, 2018 at 8:38 PM, Prem Sure 
>>> wrote:
>>>
>>>> try .pipe(.py) on RDD
>>>>
>>>> Thanks,
>>>> Prem
>>>>
>>>> On Wed, Jul 4, 2018 at 7:59 PM, Chetan Khatri <
>>>> chetan.opensou...@gmail.com> wrote:
>>>>
>>>>> Can someone please suggest me , thanks
>>>>>
>>>>> On Tue 3 Jul, 2018, 5:28 PM Chetan Khatri, <
>>>>> chetan.opensou...@gmail.com> wrote:
>>>>>
>>>>>> Hello Dear Spark User / Dev,
>>>>>>
>>>>>> I would like to pass Python user defined function to Spark Job
>>>>>> developed using Scala and return value of that function would be returned
>>>>>> to DF / Dataset API.
>>>>>>
>>>>>> Can someone please guide me, which would be best approach to do this.
>>>>>> Python function would be mostly transformation function. Also would like 
>>>>>> to
>>>>>> pass Java Function as a String to Spark / Scala job and it applies to 
>>>>>> RDD /
>>>>>> Data Frame and should return RDD / Data Frame.
>>>>>>
>>>>>> Thank you.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>
>>
>

Re: Run Python User Defined Functions / code in Spark with Scala Codebase

2018-07-09 Thread Chetan Khatri

Hello Jayant,

Thank you so much for suggestion. My view was to  use Python function as
transformation which can take couple of column names and return object.
which you explained. would that possible to point me to similiar codebase
example.

Thanks.

On Fri, Jul 6, 2018 at 2:56 AM, Jayant Shekhar 
wrote:

> Hello Chetan,
>
> We have currently done it with .pipe(.py) as Prem suggested.
>
> That passes the RDD as CSV strings to the python script. The python script
> can either process it line by line, create the result and return it back.
> Or create things like Pandas Dataframe for processing and finally write the
> results back.
>
> In the Spark/Scala/Java code, you get an RDD of string, which we convert
> back to a Dataframe.
>
> Feel free to ping me directly in case of questions.
>
> Thanks,
> Jayant
>
>
> On Thu, Jul 5, 2018 at 3:39 AM, Chetan Khatri  > wrote:
>
>> Prem sure, Thanks for suggestion.
>>
>> On Wed, Jul 4, 2018 at 8:38 PM, Prem Sure  wrote:
>>
>>> try .pipe(.py) on RDD
>>>
>>> Thanks,
>>> Prem
>>>
>>> On Wed, Jul 4, 2018 at 7:59 PM, Chetan Khatri <
>>> chetan.opensou...@gmail.com> wrote:
>>>
>>>> Can someone please suggest me , thanks
>>>>
>>>> On Tue 3 Jul, 2018, 5:28 PM Chetan Khatri, 
>>>> wrote:
>>>>
>>>>> Hello Dear Spark User / Dev,
>>>>>
>>>>> I would like to pass Python user defined function to Spark Job
>>>>> developed using Scala and return value of that function would be returned
>>>>> to DF / Dataset API.
>>>>>
>>>>> Can someone please guide me, which would be best approach to do this.
>>>>> Python function would be mostly transformation function. Also would like 
>>>>> to
>>>>> pass Java Function as a String to Spark / Scala job and it applies to RDD 
>>>>> /
>>>>> Data Frame and should return RDD / Data Frame.
>>>>>
>>>>> Thank you.
>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>
>

Re: Run Python User Defined Functions / code in Spark with Scala Codebase

2018-07-05 Thread Jayant Shekhar

Hello Chetan,

We have currently done it with .pipe(.py) as Prem suggested.

That passes the RDD as CSV strings to the python script. The python script
can either process it line by line, create the result and return it back.
Or create things like Pandas Dataframe for processing and finally write the
results back.

In the Spark/Scala/Java code, you get an RDD of string, which we convert
back to a Dataframe.

Feel free to ping me directly in case of questions.

Thanks,
Jayant

On Thu, Jul 5, 2018 at 3:39 AM, Chetan Khatri 
wrote:

> Prem sure, Thanks for suggestion.
>
> On Wed, Jul 4, 2018 at 8:38 PM, Prem Sure  wrote:
>
>> try .pipe(.py) on RDD
>>
>> Thanks,
>> Prem
>>
>> On Wed, Jul 4, 2018 at 7:59 PM, Chetan Khatri <
>> chetan.opensou...@gmail.com> wrote:
>>
>>> Can someone please suggest me , thanks
>>>
>>> On Tue 3 Jul, 2018, 5:28 PM Chetan Khatri, 
>>> wrote:
>>>
>>>> Hello Dear Spark User / Dev,
>>>>
>>>> I would like to pass Python user defined function to Spark Job
>>>> developed using Scala and return value of that function would be returned
>>>> to DF / Dataset API.
>>>>
>>>> Can someone please guide me, which would be best approach to do this.
>>>> Python function would be mostly transformation function. Also would like to
>>>> pass Java Function as a String to Spark / Scala job and it applies to RDD /
>>>> Data Frame and should return RDD / Data Frame.
>>>>
>>>> Thank you.
>>>>
>>>>
>>>>
>>>>
>>
>

Re: Run Python User Defined Functions / code in Spark with Scala Codebase

2018-07-05 Thread Chetan Khatri

Prem sure, Thanks for suggestion.

On Wed, Jul 4, 2018 at 8:38 PM, Prem Sure  wrote:

> try .pipe(.py) on RDD
>
> Thanks,
> Prem
>
> On Wed, Jul 4, 2018 at 7:59 PM, Chetan Khatri  > wrote:
>
>> Can someone please suggest me , thanks
>>
>> On Tue 3 Jul, 2018, 5:28 PM Chetan Khatri, 
>> wrote:
>>
>>> Hello Dear Spark User / Dev,
>>>
>>> I would like to pass Python user defined function to Spark Job developed
>>> using Scala and return value of that function would be returned to DF /
>>> Dataset API.
>>>
>>> Can someone please guide me, which would be best approach to do this.
>>> Python function would be mostly transformation function. Also would like to
>>> pass Java Function as a String to Spark / Scala job and it applies to RDD /
>>> Data Frame and should return RDD / Data Frame.
>>>
>>> Thank you.
>>>
>>>
>>>
>>>
>

Re: Run Python User Defined Functions / code in Spark with Scala Codebase

2018-07-04 Thread Prem Sure

try .pipe(.py) on RDD

Thanks,
Prem

On Wed, Jul 4, 2018 at 7:59 PM, Chetan Khatri 
wrote:

> Can someone please suggest me , thanks
>
> On Tue 3 Jul, 2018, 5:28 PM Chetan Khatri, 
> wrote:
>
>> Hello Dear Spark User / Dev,
>>
>> I would like to pass Python user defined function to Spark Job developed
>> using Scala and return value of that function would be returned to DF /
>> Dataset API.
>>
>> Can someone please guide me, which would be best approach to do this.
>> Python function would be mostly transformation function. Also would like to
>> pass Java Function as a String to Spark / Scala job and it applies to RDD /
>> Data Frame and should return RDD / Data Frame.
>>
>> Thank you.
>>
>>
>>
>>

Re: Run Python User Defined Functions / code in Spark with Scala Codebase

2018-07-04 Thread Chetan Khatri

Can someone please suggest me , thanks

On Tue 3 Jul, 2018, 5:28 PM Chetan Khatri, 
wrote:

> Hello Dear Spark User / Dev,
>
> I would like to pass Python user defined function to Spark Job developed
> using Scala and return value of that function would be returned to DF /
> Dataset API.
>
> Can someone please guide me, which would be best approach to do this.
> Python function would be mostly transformation function. Also would like to
> pass Java Function as a String to Spark / Scala job and it applies to RDD /
> Data Frame and should return RDD / Data Frame.
>
> Thank you.
>
>
>
>

Run Python User Defined Functions / code in Spark with Scala Codebase

2018-07-03 Thread Chetan Khatri

Hello Dear Spark User / Dev,

I would like to pass Python user defined function to Spark Job developed
using Scala and return value of that function would be returned to DF /
Dataset API.

Can someone please guide me, which would be best approach to do this.
Python function would be mostly transformation function. Also would like to
pass Java Function as a String to Spark / Scala job and it applies to RDD /
Data Frame and should return RDD / Data Frame.

Thank you.

Re: Spark with Scala 2.12

2018-04-21 Thread Mark Hamstra

Even more to the point:
http://apache-spark-developers-list.1001551.n3.nabble.com/Scala-2-12-support-td23833.html

tldr; It's an item of discussion, but there is no imminent release of Spark
that will use Scala 2.12.

On Sat, Apr 21, 2018 at 2:44 AM, purijatin  wrote:

> I see a discussion post on the dev mailing list:
> http://apache-spark-developers-list.1001551.n3.nabble.com/time-for-Apache-
> Spark-3-0-td23755.html#a23830
>
> Thanks
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: Spark with Scala 2.12

2018-04-21 Thread purijatin

I see a discussion post on the dev mailing list:
http://apache-spark-developers-list.1001551.n3.nabble.com/time-for-Apache-Spark-3-0-td23755.html#a23830

Thanks



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Spark with Scala 2.12

2018-04-20 Thread Jatin Puri

Hello.

I am wondering, if there is any new update on Spark upgrade to Scala 2.12.
https://issues.apache.org/jira/browse/SPARK-14220. Especially given that
Scala 2.13 is near the vicinity of a release.

This is because, there is no recent update on the Jira and related ticket.
May be someone is actively working on it, just that I am not aware.

The fix looks like a difficult one. it would be great if there could be
some indication on the timeline, that helps us plan better.

And it looks like a non-trivial one, for someone like me to help out with
it during my free time. Hence, I can only request.

Thanks for all the great work.

Regards,
Jatin

CI/CD for spark and scala

2018-01-24 Thread Deepak Sharma

Hi All,
I just wanted to check if there are any best practises around using CI/CD
for spark /  scala projects running on AWS hadoop clusters.
IF there is any specific tools , please do let me know.

-- 
Thanks
Deepak

Re: XML Parsing with Spark and SCala

2017-08-11 Thread Jörn Franke

Can you specify what "is not able to load" means and what are the expected 
results?



> On 11. Aug 2017, at 09:30, Etisha Jain <eti...@infoobjects.com> wrote:
> 
> Hi
> 
> I want to do xml parsing with spark, but the data from the file is not able
> to load and the desired output is also not coming.
> I am attaching a file. Can anyone help me to do this
> 
> solvePuzzle1.scala
> <http://apache-spark-user-list.1001560.n3.nabble.com/file/n29053/solvePuzzle1.scala>
>   
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/XML-Parsing-with-Spark-and-SCala-tp29053.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

XML Parsing with Spark and SCala

2017-08-11 Thread Etisha Jain

Hi

I want to do xml parsing with spark, but the data from the file is not able
to load and the desired output is also not coming.
I am attaching a file. Can anyone help me to do this

solvePuzzle1.scala
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n29053/solvePuzzle1.scala>
  



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/XML-Parsing-with-Spark-and-SCala-tp29053.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

command to get list oin spark 2.0 scala of all persisted rdd's in spark 2.0 scala shell

2017-06-01 Thread nancy henry

Hi Team,

Please let me know how to get list of all persisted RDD's ins park 2.0
shell


Regards,
Nancy

Re: Spark 2.0 Scala 2.11 and Kafka 0.10 Scala 2.10

2017-02-08 Thread Cody Koeninger

Pretty sure there was no 0.10.0.2 release of apache kafka.  If that's
a hortonworks modified version you may get better results asking in a
hortonworks specific forum.  Scala version of kafka shouldn't be
relevant either way though.

On Wed, Feb 8, 2017 at 5:30 PM, u...@moosheimer.com <u...@moosheimer.com> wrote:
> Dear devs,
>
> is it possible to use Spark 2.0.2 Scala 2.11 and consume messages from kafka
> server 0.10.0.2 running on Scala 2.10?
> I tried this the last two days by using createDirectStream and can't get no
> message out of kafka?!
>
> I'm using HDP 2.5.3 running kafka_2.10-0.10.0.2.5.3.0-37 and Spark 2.0.2.
>
> Uwe
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Spark 2.0 Scala 2.11 and Kafka 0.10 Scala 2.10

2017-02-08 Thread u...@moosheimer.com

Dear devs,

is it possible to use Spark 2.0.2 Scala 2.11 and consume messages from
kafka server 0.10.0.2 running on Scala 2.10?
I tried this the last two days by using createDirectStream and can't get
no message out of kafka?!

I'm using HDP 2.5.3 running kafka_2.10-0.10.0.2.5.3.0-37 and Spark 2.0.2.

Uwe

Re: Assembly for Kafka >= 0.10.0, Spark 2.2.0, Scala 2.11

2017-01-18 Thread Cody Koeninger

Spark 2.2 hasn't been released yet, has it?

Python support in kafka dstreams for 0.10 is probably never, there's a
jira ticket about this.

Stable, hard to say.  It was quite a few releases before 0.8 was
marked stable, even though it underwent little change.

On Wed, Jan 18, 2017 at 2:21 AM, Karamba <phantom...@web.de> wrote:
> |Hi, I am looking for an assembly for Spark 2.2.0 with Scala 2.11. I
> can't find one in MVN Repository. Moreover, "org.apache.spark" %%
> "spark-streaming-kafka-0-10_2.11" % "2.1.0 shows that even sbt does not
> find one: [error] (*:update) sbt.ResolveException: unresolved
> dependency: org.apache.spark#spark-streaming-kafka-0-10_2.11_2.11;2.1.0:
> not found Where do I find that a library? Thanks and best regards,
> karamba PS: Does anybody know when python support becomes available in
> spark-streaming-kafka-0-10 and when it will reach "stable"? |
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Assembly for Kafka >= 0.10.0, Spark 2.2.0, Scala 2.11

2017-01-18 Thread Karamba

|Hi, I am looking for an assembly for Spark 2.2.0 with Scala 2.11. I
can't find one in MVN Repository. Moreover, "org.apache.spark" %%
"spark-streaming-kafka-0-10_2.11" % "2.1.0 shows that even sbt does not
find one: [error] (*:update) sbt.ResolveException: unresolved
dependency: org.apache.spark#spark-streaming-kafka-0-10_2.11_2.11;2.1.0:
not found Where do I find that a library? Thanks and best regards,
karamba PS: Does anybody know when python support becomes available in
spark-streaming-kafka-0-10 and when it will reach "stable"? |


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: [Spark SQL - Scala] TestHive not working in Spark 2

2017-01-13 Thread Xin Wu

In terms of the nullPointerException, i think it is bug. since the test
data directories might be moved already. so it failed to load the test data
to create the test tables. You may create a jira for this.

On Fri, Jan 13, 2017 at 11:44 AM, Xin Wu <xwu0...@gmail.com> wrote:

> If you are using spark-shell, you have instance "sc" as the SparkContext
> initialized already. If you are writing your own application, you need to
> create a SparkSession, which comes with the SparkContext. So you can
> reference it like sparkSession.sparkContext.
>
> In terms of creating a table from DataFrame, do you intend to create it
> via TestHive? or just want to create a Hive serde table for the DataFrame?
>
> On Fri, Jan 13, 2017 at 10:23 AM, Nicolas Tallineau <
> nicolas.tallin...@ubisoft.com> wrote:
>
>> But it forces you to create your own SparkContext, which I’d rather not
>> do.
>>
>>
>>
>> Also it doesn’t seem to allow me to directly create a table from a
>> DataFrame, as follow:
>>
>>
>>
>> TestHive.createDataFrame[MyType](rows).write.saveAsTable("a_table")
>>
>>
>>
>> *From:* Xin Wu [mailto:xwu0...@gmail.com]
>> *Sent:* 13 janvier 2017 12:43
>> *To:* Nicolas Tallineau <nicolas.tallin...@ubisoft.com>
>> *Cc:* user@spark.apache.org
>> *Subject:* Re: [Spark SQL - Scala] TestHive not working in Spark 2
>>
>>
>>
>> I used the following:
>>
>>
>> val testHive = new org.apache.spark.sql.hive.test.TestHiveContext(sc,
>> *false*)
>>
>> val hiveClient = testHive.sessionState.metadataHive
>> hiveClient.runSqlHive(“….”)
>>
>>
>>
>> On Fri, Jan 13, 2017 at 6:40 AM, Nicolas Tallineau <
>> nicolas.tallin...@ubisoft.com> wrote:
>>
>> I get a nullPointerException as soon as I try to execute a
>> TestHive.sql(...) statement since migrating to Spark 2 because it's trying
>> to load non existing "test tables". I couldn't find a way to switch to
>> false the loadTestTables variable.
>>
>>
>>
>> Caused by: sbt.ForkMain$ForkError: java.lang.NullPointerException: null
>>
>> at org.apache.spark.sql.hive.test
>> .TestHiveSparkSession.getHiveFile(TestHive.scala:190)
>>
>> at org.apache.spark.sql.hive.test
>> .TestHiveSparkSession.org$apache$spark$sql$hive$test$TestHiv
>> eSparkSession$$quoteHiveFile(TestHive.scala:196)
>>
>> at org.apache.spark.sql.hive.test
>> .TestHiveSparkSession.(TestHive.scala:234)
>>
>> at org.apache.spark.sql.hive.test
>> .TestHiveSparkSession.(TestHive.scala:122)
>>
>> at org.apache.spark.sql.hive.test
>> .TestHiveContext.(TestHive.scala:80)
>>
>> at org.apache.spark.sql.hive.test
>> .TestHive$.(TestHive.scala:47)
>>
>> at org.apache.spark.sql.hive.test
>> .TestHive$.(TestHive.scala)
>>
>>
>>
>> I’m using Spark 2.1.0 in this case.
>>
>>
>>
>> Am I missing something or should I create a bug in Jira?
>>
>>
>>
>>
>>
>> --
>>
>> Xin Wu
>> (650)392-9799 <(650)%20392-9799>
>>
>
>
>
> --
> Xin Wu
> (650)392-9799 <(650)%20392-9799>
>



-- 
Xin Wu
(650)392-9799

Re: [Spark SQL - Scala] TestHive not working in Spark 2

2017-01-13 Thread Xin Wu

If you are using spark-shell, you have instance "sc" as the SparkContext
initialized already. If you are writing your own application, you need to
create a SparkSession, which comes with the SparkContext. So you can
reference it like sparkSession.sparkContext.

In terms of creating a table from DataFrame, do you intend to create it via
TestHive? or just want to create a Hive serde table for the DataFrame?

On Fri, Jan 13, 2017 at 10:23 AM, Nicolas Tallineau <
nicolas.tallin...@ubisoft.com> wrote:

> But it forces you to create your own SparkContext, which I’d rather not do.
>
>
>
> Also it doesn’t seem to allow me to directly create a table from a
> DataFrame, as follow:
>
>
>
> TestHive.createDataFrame[MyType](rows).write.saveAsTable("a_table")
>
>
>
> *From:* Xin Wu [mailto:xwu0...@gmail.com]
> *Sent:* 13 janvier 2017 12:43
> *To:* Nicolas Tallineau <nicolas.tallin...@ubisoft.com>
> *Cc:* user@spark.apache.org
> *Subject:* Re: [Spark SQL - Scala] TestHive not working in Spark 2
>
>
>
> I used the following:
>
>
> val testHive = new org.apache.spark.sql.hive.test.TestHiveContext(sc,
> *false*)
>
> val hiveClient = testHive.sessionState.metadataHive
> hiveClient.runSqlHive(“….”)
>
>
>
> On Fri, Jan 13, 2017 at 6:40 AM, Nicolas Tallineau <
> nicolas.tallin...@ubisoft.com> wrote:
>
> I get a nullPointerException as soon as I try to execute a
> TestHive.sql(...) statement since migrating to Spark 2 because it's trying
> to load non existing "test tables". I couldn't find a way to switch to
> false the loadTestTables variable.
>
>
>
> Caused by: sbt.ForkMain$ForkError: java.lang.NullPointerException: null
>
> at org.apache.spark.sql.hive.test.TestHiveSparkSession.
> getHiveFile(TestHive.scala:190)
>
> at org.apache.spark.sql.hive.test.TestHiveSparkSession.org
> $apache$spark$sql$hive$test$TestHiveSparkSession$$
> quoteHiveFile(TestHive.scala:196)
>
> at org.apache.spark.sql.hive.test.TestHiveSparkSession.<
> init>(TestHive.scala:234)
>
> at org.apache.spark.sql.hive.test.TestHiveSparkSession.<
> init>(TestHive.scala:122)
>
> at org.apache.spark.sql.hive.test.TestHiveContext.(
> TestHive.scala:80)
>
> at org.apache.spark.sql.hive.test.TestHive$.(
> TestHive.scala:47)
>
> at org.apache.spark.sql.hive.test.TestHive$.(
> TestHive.scala)
>
>
>
> I’m using Spark 2.1.0 in this case.
>
>
>
> Am I missing something or should I create a bug in Jira?
>
>
>
>
>
> --
>
> Xin Wu
> (650)392-9799 <(650)%20392-9799>
>



-- 
Xin Wu
(650)392-9799

RE: [Spark SQL - Scala] TestHive not working in Spark 2

2017-01-13 Thread Nicolas Tallineau

But it forces you to create your own SparkContext, which I’d rather not do.

Also it doesn’t seem to allow me to directly create a table from a DataFrame, 
as follow:

TestHive.createDataFrame[MyType](rows).write.saveAsTable("a_table")

From: Xin Wu [mailto:xwu0...@gmail.com]
Sent: 13 janvier 2017 12:43
To: Nicolas Tallineau <nicolas.tallin...@ubisoft.com>
Cc: user@spark.apache.org
Subject: Re: [Spark SQL - Scala] TestHive not working in Spark 2

I used the following:

val testHive = new org.apache.spark.sql.hive.test.TestHiveContext(sc, false)
val hiveClient = testHive.sessionState.metadataHive
hiveClient.runSqlHive(“….”)

On Fri, Jan 13, 2017 at 6:40 AM, Nicolas Tallineau 
<nicolas.tallin...@ubisoft.com<mailto:nicolas.tallin...@ubisoft.com>> wrote:
I get a nullPointerException as soon as I try to execute a TestHive.sql(...) 
statement since migrating to Spark 2 because it's trying to load non existing 
"test tables". I couldn't find a way to switch to false the loadTestTables 
variable.

Caused by: sbt.ForkMain$ForkError: java.lang.NullPointerException: null
at 
org.apache.spark.sql.hive.test.TestHiveSparkSession.getHiveFile(TestHive.scala:190)
at 
org.apache.spark.sql.hive.test.TestHiveSparkSession.org<http://org.apache.spark.sql.hive.test.TestHiveSparkSession.org>$apache$spark$sql$hive$test$TestHiveSparkSession$$quoteHiveFile(TestHive.scala:196)
at 
org.apache.spark.sql.hive.test.TestHiveSparkSession.(TestHive.scala:234)
at 
org.apache.spark.sql.hive.test.TestHiveSparkSession.(TestHive.scala:122)
at 
org.apache.spark.sql.hive.test.TestHiveContext.(TestHive.scala:80)
at 
org.apache.spark.sql.hive.test.TestHive$.(TestHive.scala:47)
at 
org.apache.spark.sql.hive.test.TestHive$.(TestHive.scala)

I’m using Spark 2.1.0 in this case.

Am I missing something or should I create a bug in Jira?

--
Xin Wu
(650)392-9799

Re: [Spark SQL - Scala] TestHive not working in Spark 2

2017-01-13 Thread Xin Wu

I used the following:

val testHive = new org.apache.spark.sql.hive.test.TestHiveContext(sc,
*false*)
val hiveClient = testHive.sessionState.metadataHive
hiveClient.runSqlHive(“….”)



On Fri, Jan 13, 2017 at 6:40 AM, Nicolas Tallineau <
nicolas.tallin...@ubisoft.com> wrote:

> I get a nullPointerException as soon as I try to execute a
> TestHive.sql(...) statement since migrating to Spark 2 because it's trying
> to load non existing "test tables". I couldn't find a way to switch to
> false the loadTestTables variable.
>
>
>
> Caused by: sbt.ForkMain$ForkError: java.lang.NullPointerException: null
>
> at org.apache.spark.sql.hive.test.TestHiveSparkSession.
> getHiveFile(TestHive.scala:190)
>
> at org.apache.spark.sql.hive.test.TestHiveSparkSession.org
> $apache$spark$sql$hive$test$TestHiveSparkSession$$
> quoteHiveFile(TestHive.scala:196)
>
> at org.apache.spark.sql.hive.test.TestHiveSparkSession.<
> init>(TestHive.scala:234)
>
> at org.apache.spark.sql.hive.test.TestHiveSparkSession.<
> init>(TestHive.scala:122)
>
> at org.apache.spark.sql.hive.test.TestHiveContext.(
> TestHive.scala:80)
>
> at org.apache.spark.sql.hive.test.TestHive$.(
> TestHive.scala:47)
>
> at org.apache.spark.sql.hive.test.TestHive$.(
> TestHive.scala)
>
>
>
> I’m using Spark 2.1.0 in this case.
>
>
>
> Am I missing something or should I create a bug in Jira?
>



-- 
Xin Wu
(650)392-9799

[Spark SQL - Scala] TestHive not working in Spark 2

2017-01-13 Thread Nicolas Tallineau

I get a nullPointerException as soon as I try to execute a TestHive.sql(...) 
statement since migrating to Spark 2 because it's trying to load non existing 
"test tables". I couldn't find a way to switch to false the loadTestTables 
variable.

Caused by: sbt.ForkMain$ForkError: java.lang.NullPointerException: null
at 
org.apache.spark.sql.hive.test.TestHiveSparkSession.getHiveFile(TestHive.scala:190)
at 
org.apache.spark.sql.hive.test.TestHiveSparkSession.org$apache$spark$sql$hive$test$TestHiveSparkSession$$quoteHiveFile(TestHive.scala:196)
at 
org.apache.spark.sql.hive.test.TestHiveSparkSession.(TestHive.scala:234)
at 
org.apache.spark.sql.hive.test.TestHiveSparkSession.(TestHive.scala:122)
at 
org.apache.spark.sql.hive.test.TestHiveContext.(TestHive.scala:80)
at 
org.apache.spark.sql.hive.test.TestHive$.(TestHive.scala:47)
at 
org.apache.spark.sql.hive.test.TestHive$.(TestHive.scala)

I'm using Spark 2.1.0 in this case.

Am I missing something or should I create a bug in Jira?

Pasting oddity with Spark 2.0 (scala)

2016-11-14 Thread jggg777

This one has stumped the group here, hoping to get some insight into why this
error is happening.

I'm going through the  Databricks DataFrames scala docs
<https://docs.cloud.databricks.com/docs/latest/databricks_guide/index.html#04%20SQL,%20DataFrames%20%26%20Datasets/02%20Introduction%20to%20DataFrames%20-%20scala.html>
 
.  Halfway down is the "Flattening" code, which I've copied below:

*The original code*

>>>
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._

implicit class DataFrameFlattener(df: DataFrame) {
  def flattenSchema: DataFrame = {
df.select(flatten(Nil, df.schema): _*)
  }
  
  protected def flatten(path: Seq[String], schema: DataType): Seq[Column] =
schema match {
case s: StructType => s.fields.flatMap(f => flatten(path :+ f.name,
f.dataType))
case other => col(path.map(n =>
s"`$n`").mkString(".")).as(path.mkString(".")) :: Nil
  }
}
>>>

*Pasting into spark-shell with right-click (or pasting into Zeppelin)*


On EMR using Spark 2.0.0 (this also happens on Spark 2.0.1), running
"spark-shell", I right click to paste in the code above.  Here are the
errors I get.  Note that I get the same errors when I paste into Zeppelin on
EMR.

>>>
scala> import org.apache.spark.sql._
import org.apache.spark.sql._

scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._

scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._

scala>

scala> implicit class DataFrameFlattener(df: DataFrame) {
 |   def flattenSchema: DataFrame = {
 | df.select(flatten(Nil, df.schema): _*)
 |   }
 |
 |   protected def flatten(path: Seq[String], schema: DataType):
Seq[Column] = schema match {
 | case s: StructType => s.fields.flatMap(f => flatten(path :+
f.name, f.dataType))
 | case other => col(path.map(n =>
s"`$n`").mkString(".")).as(path.mkString(".")) :: Nil
 |   }
 | }
:11: error: not found: type DataFrame
   implicit class DataFrameFlattener(df: DataFrame) {
 ^
:12: error: not found: type DataFrame
 def flattenSchema: DataFrame = {
^
:16: error: not found: type Column
 protected def flatten(path: Seq[String], schema: DataType):
Seq[Column] = schema match {
 ^
:16: error: not found: type DataType
 protected def flatten(path: Seq[String], schema: DataType):
Seq[Column] = schema match {
  ^
:17: error: not found: type StructType
   case s: StructType => s.fields.flatMap(f => flatten(path :+
f.name, f.dataType))
   ^
:18: error: not found: value col
   case other => col(path.map(n =>
s"`$n`").mkString(".")).as(path.mkString(".")) :: Nil
>>>

*Pasting using :paste in spark-shell*


However when I paste the same code into spark-shell using :paste, the code
succeeds.

>>>
scala> :paste
// Entering paste mode (ctrl-D to finish)

import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._

implicit class DataFrameFlattener(df: DataFrame) {
  def flattenSchema: DataFrame = {
df.select(flatten(Nil, df.schema): _*)
  }

  protected def flatten(path: Seq[String], schema: DataType): Seq[Column] =
schema match {
case s: StructType => s.fields.flatMap(f => flatten(path :+ f.name,
f.dataType))
case other => col(path.map(n =>
s"`$n`").mkString(".")).as(path.mkString(".")) :: Nil
  }
}

// Exiting paste mode, now interpreting.

import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
defined class DataFrameFlattener
>>>


Any idea what's going on here, and how to get this code working in Zeppelin? 
One thing we've found is that providing the full paths for DataFrame,
StructType, etc (for example org.apache.spark.sql.DataFrame) does work, but
it's a painful workaround and we don't know why the imports don't seem to be
working as usual.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Pasting-oddity-with-Spark-2-0-scala-tp28071.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Real Time Recommendation Engines with Spark and Scala

2016-09-05 Thread Mich Talebzadeh

o Roman
>>>> [image: https://]about.me/alonso.isidoro.roman
>>>>
>>>> <https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>
>>>>
>>>> 2016-09-05 16:39 GMT+02:00 Mich Talebzadeh <mich.talebza...@gmail.com>:
>>>>
>>>>> Thank you Alonso,
>>>>>
>>>>> I looked at your project. Interesting
>>>>>
>>>>> As I see it this is what you are suggesting
>>>>>
>>>>>
>>>>>1. A kafka producer is going to ask periodically to Amazon in
>>>>>order to know what products based on my own ratings and i am going to
>>>>>introduced them into some kafka topic.
>>>>>2. A spark streaming process is going to read from that previous
>>>>>topic.
>>>>>3. Apply some machine learning algorithms (ALS, content based
>>>>>filtering colaborative filtering) on those datasets readed by the spark
>>>>>streaming process.
>>>>>4. Save results in a mongo or cassandra instance.
>>>>>5. Use play framework to create an websocket interface between the
>>>>>mongo instance and the visual interface.
>>>>>
>>>>>
>>>>> As I understand
>>>>>
>>>>> Point 1: A kafka producer is going to ask periodically to Amazon in
>>>>> order to know what products based on my own ratings .
>>>>>
>>>>>
>>>>>1. What do you mean "my own rating" here? You know the products.
>>>>>So what Amazon is going to provide by way of Kafka?
>>>>>2. Assuming that you have created topic specifically for this
>>>>>purpose then that topic is streamed into Kafka, some algorithms is 
>>>>> applied
>>>>>and results are saved in DB
>>>>>3. You have some dashboard that will fetch data (via ???) from the
>>>>>DB and I guess Zeppelin can do it here?
>>>>>
>>>>>
>>>>> Do you Have a DFD diagram for your design in case. Something like
>>>>> below (hope does not look pedantic, non intended).
>>>>>
>>>>> [image: Inline images 1]
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> LinkedIn * 
>>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>> On 5 September 2016 at 15:08, Alonso Isidoro Roman <alons...@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> Hi Mitch, i wrote few months ago a tiny project with this issue in
>>>>>> mind. The idea is to apply ALS algorithm in order to get some valid
>>>>>> recommendations from another users.
>>>>>>
>>>>>>
>>>>>> The url of the project
>>>>>> <https://github.com/alonsoir/awesome-recommendation-engine>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Alonso Isidoro Roman
>>>>>> [image: https://]about.me/alonso.isidoro.roman
>>>>>>
>>>>>> <https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>
>>>>>>
>>>>>> 2016-09-05 15:41 GMT+02:00 Mich Talebzadeh <mich.talebza...@gmail.com
>>>>>> >:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Has anyone done any work on Real time recommendation engines with
>>>>>>> Spark and Scala.
>>>>>>>
>>>>>>> I have seen few PPTs with Python but wanted to see if these have
>>>>>>> been done with Scala.
>>>>>>>
>>>>>>> I trust this question makes sense.
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> p.s. My prime interest would be in Financial markets.
>>>>>>>
>>>>>>>
>>>>>>> Dr Mich Talebzadeh
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> LinkedIn * 
>>>>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>
>>>>>>>
>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>>> may
>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>>> damages
>>>>>>> arising from such loss, damage or destruction.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Real Time Recommendation Engines with Spark and Scala

2016-09-05 Thread Alonso Isidoro Roman

By the way, i would love to work in your project, looks promising!



Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>

2016-09-05 16:57 GMT+02:00 Alonso Isidoro Roman <alons...@gmail.com>:

> Hi Mitch,
>
>
>1. What do you mean "my own rating" here? You know the products. So
>what Amazon is going to provide by way of Kafka?
>
> The idea was to embed the functionality of a kafka producer within a rest
> service in order i can invoke this logic with my a rating. I did not create
> such functionality because i started to make another things, i get bored,
> basically. I created some unix commands with this code, using sbt-pack.
>
>
>
>1. Assuming that you have created topic specifically for this purpose
>then that topic is streamed into Kafka, some algorithms is applied and
>results are saved in DB
>
> you got it!
>
>1. You have some dashboard that will fetch data (via ???) from the DB
>and I guess Zeppelin can do it here?
>
>
> No, not any dashboard yet. Maybe it is not a good idea to connect mongodb
> with a dashboard through a web socket. It probably works, for a proof of
> concept, but, in a real project? i don't know yet...
>
> You can see what i did to push data within a kafka topic in this scala
> class
> <https://github.com/alonsoir/awesome-recommendation-engine/blob/master/src/main/scala/example/producer/AmazonProducerExample.scala>,
> you have to invoke pack within the scala shell to create this unix command.
>
> Regards!
>
> Alonso
>
>
>
> Alonso Isidoro Roman
> [image: https://]about.me/alonso.isidoro.roman
>
> <https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>
>
> 2016-09-05 16:39 GMT+02:00 Mich Talebzadeh <mich.talebza...@gmail.com>:
>
>> Thank you Alonso,
>>
>> I looked at your project. Interesting
>>
>> As I see it this is what you are suggesting
>>
>>
>>1. A kafka producer is going to ask periodically to Amazon in order
>>to know what products based on my own ratings and i am going to introduced
>>them into some kafka topic.
>>2. A spark streaming process is going to read from that previous
>>topic.
>>3. Apply some machine learning algorithms (ALS, content based
>>filtering colaborative filtering) on those datasets readed by the spark
>>streaming process.
>>4. Save results in a mongo or cassandra instance.
>>5. Use play framework to create an websocket interface between the
>>mongo instance and the visual interface.
>>
>>
>> As I understand
>>
>> Point 1: A kafka producer is going to ask periodically to Amazon in order
>> to know what products based on my own ratings .
>>
>>
>>1. What do you mean "my own rating" here? You know the products. So
>>what Amazon is going to provide by way of Kafka?
>>2. Assuming that you have created topic specifically for this purpose
>>then that topic is streamed into Kafka, some algorithms is applied and
>>results are saved in DB
>>3. You have some dashboard that will fetch data (via ???) from the DB
>>and I guess Zeppelin can do it here?
>>
>>
>> Do you Have a DFD diagram for your design in case. Something like below
>> (hope does not look pedantic, non intended).
>>
>> [image: Inline images 1]
>>
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 5 September 2016 at 15:08, Alonso Isidoro Roman <alons...@gmail.com>
>> wrote:
>>
>>> Hi Mitch, i wrote few months ago a tiny project with this issue in mind.
>>> The idea is to apply ALS algorithm in order to get some valid
>>> recommendations from another users.
>>>
>>>
>>> The url of

Re: Real Time Recommendation Engines with Spark and Scala

2016-09-05 Thread Alonso Isidoro Roman

Hi Mitch,


   1. What do you mean "my own rating" here? You know the products. So what
   Amazon is going to provide by way of Kafka?

The idea was to embed the functionality of a kafka producer within a rest
service in order i can invoke this logic with my a rating. I did not create
such functionality because i started to make another things, i get bored,
basically. I created some unix commands with this code, using sbt-pack.



   1. Assuming that you have created topic specifically for this purpose
   then that topic is streamed into Kafka, some algorithms is applied and
   results are saved in DB

you got it!

   1. You have some dashboard that will fetch data (via ???) from the DB
   and I guess Zeppelin can do it here?


No, not any dashboard yet. Maybe it is not a good idea to connect mongodb
with a dashboard through a web socket. It probably works, for a proof of
concept, but, in a real project? i don't know yet...

You can see what i did to push data within a kafka topic in this scala class
<https://github.com/alonsoir/awesome-recommendation-engine/blob/master/src/main/scala/example/producer/AmazonProducerExample.scala>,
you have to invoke pack within the scala shell to create this unix command.

Regards!

Alonso



Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>

2016-09-05 16:39 GMT+02:00 Mich Talebzadeh <mich.talebza...@gmail.com>:

> Thank you Alonso,
>
> I looked at your project. Interesting
>
> As I see it this is what you are suggesting
>
>
>1. A kafka producer is going to ask periodically to Amazon in order to
>know what products based on my own ratings and i am going to introduced
>them into some kafka topic.
>2. A spark streaming process is going to read from that previous topic.
>3. Apply some machine learning algorithms (ALS, content based
>filtering colaborative filtering) on those datasets readed by the spark
>streaming process.
>4. Save results in a mongo or cassandra instance.
>5. Use play framework to create an websocket interface between the
>mongo instance and the visual interface.
>
>
> As I understand
>
> Point 1: A kafka producer is going to ask periodically to Amazon in order
> to know what products based on my own ratings .
>
>
>1. What do you mean "my own rating" here? You know the products. So
>what Amazon is going to provide by way of Kafka?
>2. Assuming that you have created topic specifically for this purpose
>then that topic is streamed into Kafka, some algorithms is applied and
>results are saved in DB
>3. You have some dashboard that will fetch data (via ???) from the DB
>and I guess Zeppelin can do it here?
>
>
> Do you Have a DFD diagram for your design in case. Something like below
> (hope does not look pedantic, non intended).
>
> [image: Inline images 1]
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 5 September 2016 at 15:08, Alonso Isidoro Roman <alons...@gmail.com>
> wrote:
>
>> Hi Mitch, i wrote few months ago a tiny project with this issue in mind.
>> The idea is to apply ALS algorithm in order to get some valid
>> recommendations from another users.
>>
>>
>> The url of the project
>> <https://github.com/alonsoir/awesome-recommendation-engine>
>>
>>
>>
>> Alonso Isidoro Roman
>> [image: https://]about.me/alonso.isidoro.roman
>>
>> <https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>
>>
>> 2016-09-05 15:41 GMT+02:00 Mich Talebzadeh <mich.talebza...@gmail.com>:
>>
>>> Hi,
>>>
>>> Has anyone done any work on Real time recommendation engines with Spark
>>> and Scala.
>>>
>>> I have seen few PPTs with Python but wanted to see if these have been
>>> done with Scala.
>>>
>>> I trust this question makes sense.
>>>
>>> Thanks
>>>
>>> p.s. My prime inter

Re: Real Time Recommendation Engines with Spark and Scala

2016-09-05 Thread Mich Talebzadeh

Thank you Alonso,

I looked at your project. Interesting

As I see it this is what you are suggesting

   1. A kafka producer is going to ask periodically to Amazon in order to
   know what products based on my own ratings and i am going to introduced
   them into some kafka topic.
   2. A spark streaming process is going to read from that previous topic.
   3. Apply some machine learning algorithms (ALS, content based filtering
   colaborative filtering) on those datasets readed by the spark streaming
   process.
   4. Save results in a mongo or cassandra instance.
   5. Use play framework to create an websocket interface between the mongo
   instance and the visual interface.

As I understand

Point 1: A kafka producer is going to ask periodically to Amazon in order
to know what products based on my own ratings .

   1. What do you mean "my own rating" here? You know the products. So what
   Amazon is going to provide by way of Kafka?
   2. Assuming that you have created topic specifically for this purpose
   then that topic is streamed into Kafka, some algorithms is applied and
   results are saved in DB
   3. You have some dashboard that will fetch data (via ???) from the DB
   and I guess Zeppelin can do it here?

Do you Have a DFD diagram for your design in case. Something like below
(hope does not look pedantic, non intended).

[image: Inline images 1]

Dr Mich Talebzadeh

LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*

http://talebzadehmich.wordpress.com

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On 5 September 2016 at 15:08, Alonso Isidoro Roman <alons...@gmail.com>
wrote:

> Hi Mitch, i wrote few months ago a tiny project with this issue in mind.
> The idea is to apply ALS algorithm in order to get some valid
> recommendations from another users.
>
>
> The url of the project
> <https://github.com/alonsoir/awesome-recommendation-engine>
>
>
>
> Alonso Isidoro Roman
> [image: https://]about.me/alonso.isidoro.roman
>
> <https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>
>
> 2016-09-05 15:41 GMT+02:00 Mich Talebzadeh <mich.talebza...@gmail.com>:
>
>> Hi,
>>
>> Has anyone done any work on Real time recommendation engines with Spark
>> and Scala.
>>
>> I have seen few PPTs with Python but wanted to see if these have been
>> done with Scala.
>>
>> I trust this question makes sense.
>>
>> Thanks
>>
>> p.s. My prime interest would be in Financial markets.
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>
>

Re: Real Time Recommendation Engines with Spark and Scala

2016-09-05 Thread Alonso Isidoro Roman

Hi Mitch, i wrote few months ago a tiny project with this issue in mind.
The idea is to apply ALS algorithm in order to get some valid
recommendations from another users.


The url of the project
<https://github.com/alonsoir/awesome-recommendation-engine>



Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>

2016-09-05 15:41 GMT+02:00 Mich Talebzadeh <mich.talebza...@gmail.com>:

> Hi,
>
> Has anyone done any work on Real time recommendation engines with Spark
> and Scala.
>
> I have seen few PPTs with Python but wanted to see if these have been done
> with Scala.
>
> I trust this question makes sense.
>
> Thanks
>
> p.s. My prime interest would be in Financial markets.
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>

Real Time Recommendation Engines with Spark and Scala

2016-09-05 Thread Mich Talebzadeh

Hi,

Has anyone done any work on Real time recommendation engines with Spark and
Scala.

I have seen few PPTs with Python but wanted to see if these have been done
with Scala.

I trust this question makes sense.

Thanks

p.s. My prime interest would be in Financial markets.


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

Re: How to read *.jhist file in Spark using scala

2016-05-24 Thread Miles

Instead of reading *.jhist files direclty in Spark, you could convert your
.jhist files into Json and then read Json files in Spark.

Here's a post on converting .jhist file to json format.
http://stackoverflow.com/questions/32683907/converting-jhist-files-to-json-format



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-read-jhist-file-in-Spark-using-scala-tp26972p27015.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

spark w/ scala 2.11 and PackratParsers

2016-05-04 Thread matd

Hi folks,

Our project is a mess of scala 2.10 and 2.11, so I tried to switch
everything to 2.11.

I had some exasperating errors like this :

java.lang.NoClassDefFoundError:
org/apache/spark/sql/execution/datasources/DDLParser
at org.apache.spark.sql.SQLContext.(SQLContext.scala:208)
at org.apache.spark.sql.SQLContext.(SQLContext.scala:77)
at org.apache.spark.sql.SQLContext$.getOrCreate(SQLContext.scala:1295)

... that I was unable to fix, until I figured out that this error came first
:

java.lang.NoClassDefFoundError: scala/util/parsing/combinator/PackratParsers
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)

...that finally managed to fix by adding this dependency :
"org.scala-lang.modules"  %% "scala-parser-combinators" % "1.0.4"

As this is not documented anywhere, I'd like to now if it's just a missing
doc somewhere, or if it's hiding another problem that will jump out at my
face at some point ?

Mathieu




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-w-scala-2-11-and-PackratParsers-tp26877.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

[Ask :]Best Practices - Application logging in Spark 1.5.2 + Scala 2.10

2016-04-21 Thread Divya Gehlot

Hi,
I am using Spark with Hadoop 2.7 cluster
I need to print all my print statement and or any errors to file for
instance some info if passed some level or some error if something misisng
in my Spark Scala Script.

Can some body help me or redirect me tutorial,blog, books .
Whats the best way to achieve it.

Thanks in advance.

Divya

RE: pass one dataframe column value to another dataframe filter expression + Spark 1.5 + scala

2016-02-05 Thread Lohith Samaga M

Hi,
If you can also format the condition file as a csv file similar 
to the main file, then you can join the two dataframes and select only required 
columns.

Best regards / Mit freundlichen Grüßen / Sincères salutations
M. Lohith Samaga

From: Divya Gehlot [mailto:divya.htco...@gmail.com]
Sent: Friday, February 05, 2016 13.12
To: user @spark
Subject: pass one dataframe column value to another dataframe filter expression 
+ Spark 1.5 + scala

Hi,
I have two input datasets
First input dataset like as below :

year,make,model,comment,blank
"2012","Tesla","S","No comment",
1997,Ford,E350,"Go get one now they are going fast",
2015,Chevy,Volt

Second Input dataset :

TagId,condition
1997_cars,year = 1997 and model = 'E350'
2012_cars,year=2012 and model ='S'
2015_cars ,year=2015 and model = 'Volt'

Now my requirement is read first data set and based on the filtering condition 
in second dataset need to tag rows of first input dataset by introducing a new 
column TagId to first input data set
so the expected should look like :

year,make,model,comment,blank,TagId
"2012","Tesla","S","No comment",2012_cars
1997,Ford,E350,"Go get one now they are going fast",1997_cars
2015,Chevy,Volt, ,2015_cars

I tried like :

val sqlContext = new SQLContext(sc)
val carsSchema = StructType(Seq(
StructField("year", IntegerType, true),
StructField("make", StringType, true),
StructField("model", StringType, true),
StructField("comment", StringType, true),
StructField("blank", StringType, true)))

val carTagsSchema = StructType(Seq(
StructField("TagId", StringType, true),
StructField("condition", StringType, true)))


val dfcars = 
sqlContext.read.format("com.databricks.spark.csv").option("header", "true") 
.schema(carsSchema).load("/TestDivya/Spark/cars.csv")
val dftags = 
sqlContext.read.format("com.databricks.spark.csv").option("header", "true") 
.schema(carTagsSchema).load("/TestDivya/Spark/CarTags.csv")

val Amendeddf = dfcars.withColumn("TagId", dfcars("blank"))
val cdtnval = dftags.select("condition")
val df2=dfcars.filter(cdtnval)
:35: error: overloaded method value filter with alternatives:
  (conditionExpr: String)org.apache.spark.sql.DataFrame 
  (condition: org.apache.spark.sql.Column)org.apache.spark.sql.DataFrame
 cannot be applied to (org.apache.spark.sql.DataFrame)
   val df2=dfcars.filter(cdtnval)

another way :

val col = dftags.col("TagId")
val finaldf = dfcars.withColumn("TagId", col)
org.apache.spark.sql.AnalysisException: resolved attribute(s) TagId#5 
missing from comment#3,blank#4,model#2,make#1,year#0 in operator !Project 
[year#0,make#1,model#2,comment#3,blank#4,TagId#5 AS TagId#8];

finaldf.write.format("com.databricks.spark.csv").option("header", 
"true").save("/TestDivya/Spark/carswithtags.csv")


Would really appreciate if somebody give me pointers how can I pass the filter 
condition(second dataframe) to filter function of first dataframe.
Or another solution .
My apppologies for such a naive question as I am new to scala and Spark

Thanks
Information transmitted by this e-mail is proprietary to Mphasis, its 
associated companies and/ or its customers and is intended 
for use only by the individual or entity to which it is addressed, and may 
contain information that is privileged, confidential or 
exempt from disclosure under applicable law. If you are not the intended 
recipient or it appears that this mail has been forwarded 
to you without proper authority, you are notified that any use or dissemination 
of this information in any manner is strictly 
prohibited. In such cases, please notify us immediately at 
mailmas...@mphasis.com and delete this mail from your records.

Re: pass one dataframe column value to another dataframe filter expression + Spark 1.5 + scala

2016-02-05 Thread Ali Tajeldin

I think the tricky part here is that the join condition is encoded in the 
second data frame and not a direct value.

Assuming the second data frame (the tags) is small enough, you can collect it 
(read it into memory) and then construct a "when" expression chain for each of 
the possible tags , then all you need to do is select on data frame 1 (df1) (or 
use "withColumn" to add the column) and provide the constructed "when" chain as 
the new column value.
From a very high level, looking at your example data below, you would construct 
the following expression from df2 (off course, in the real case, you would 
construct the expression programmatically from the collected df2 data and not 
hardcoded).

val e = when(expr("year = 1997 and model = 'E350'"), "1997_cars").
...
   when(expr("year = 2015 and model = 'Volt'"), "2015_cars").
   otherwise("unknown")

then you just need to add the new column to your input as:
df1.withColumn("tag", e)

I caveat this by saying that I have not tried the above (especially using 
"expr" to evaluate partial sql expressions, but should work according to doc).
Sometimes, half the battle is just finding the right API.  the 
"when"/"otherwise" is documented under the "Column" class and 
"withColumn"/"collect" are documented under "DataFrame".

--
Ali

 
On Feb 5, 2016, at 1:56 AM, Lohith Samaga M <lohith.sam...@mphasis.com> wrote:

> Hi,
> If you can also format the condition file as a csv file 
> similar to the main file, then you can join the two dataframes and select 
> only required columns.
>  
> Best regards / Mit freundlichen Grüßen / Sincères salutations
> M. Lohith Samaga
>  
> From: Divya Gehlot [mailto:divya.htco...@gmail.com] 
> Sent: Friday, February 05, 2016 13.12
> To: user @spark
> Subject: pass one dataframe column value to another dataframe filter 
> expression + Spark 1.5 + scala
>  
> Hi,
> I have two input datasets
> First input dataset like as below : 
>  
> year,make,model,comment,blank
> "2012","Tesla","S","No comment",
> 1997,Ford,E350,"Go get one now they are going fast",
> 2015,Chevy,Volt
>  
> Second Input dataset :
>  
> TagId,condition
> 1997_cars,year = 1997 and model = 'E350'
> 2012_cars,year=2012 and model ='S'
> 2015_cars ,year=2015 and model = 'Volt'
>  
> Now my requirement is read first data set and based on the filtering 
> condition in second dataset need to tag rows of first input dataset by 
> introducing a new column TagId to first input data set 
> so the expected should look like :
>  
> year,make,model,comment,blank,TagId
> "2012","Tesla","S","No comment",2012_cars
> 1997,Ford,E350,"Go get one now they are going fast",1997_cars
> 2015,Chevy,Volt, ,2015_cars
>  
> I tried like :
>  
> val sqlContext = new SQLContext(sc)
> val carsSchema = StructType(Seq(
> StructField("year", IntegerType, true),
> StructField("make", StringType, true),
> StructField("model", StringType, true),
> StructField("comment", StringType, true),
> StructField("blank", StringType, true)))
> 
> val carTagsSchema = StructType(Seq(
> StructField("TagId", StringType, true),
> StructField("condition", StringType, true)))
> 
> 
> val dfcars = 
> sqlContext.read.format("com.databricks.spark.csv").option("header", "true") 
> .schema(carsSchema).load("/TestDivya/Spark/cars.csv")
> val dftags = 
> sqlContext.read.format("com.databricks.spark.csv").option("header", "true") 
> .schema(carTagsSchema).load("/TestDivya/Spark/CarTags.csv")
> 
> val Amendeddf = dfcars.withColumn("TagId", dfcars("blank"))
> val cdtnval = dftags.select("condition")
> val df2=dfcars.filter(cdtnval)
> :35: error: overloaded method value filter with alternatives:
>   (conditionExpr: String)org.apache.spark.sql.DataFrame 
>   (condition: org.apache.spark.sql.Column)org.apache.spark.sql.DataFrame
>  cannot be applied to (org.apache.spark.sql.DataFrame)
>val df2=dfcars.filter(cdtnval)
>  
> another way :
>  
> val col = dftags.col("TagId")
> val finaldf = dfcars.withColumn("TagId", col)
> org.apache.spark.sql.AnalysisException: resolved attribute(s) TagId#5 
> missing from comm

pass one dataframe column value to another dataframe filter expression + Spark 1.5 + scala

2016-02-04 Thread Divya Gehlot

Hi,
I have two input datasets
First input dataset like as below :

year,make,model,comment,blank
> "2012","Tesla","S","No comment",
> 1997,Ford,E350,"Go get one now they are going fast",
> 2015,Chevy,Volt


Second Input dataset :

TagId,condition
> 1997_cars,year = 1997 and model = 'E350'
> 2012_cars,year=2012 and model ='S'
> 2015_cars ,year=2015 and model = 'Volt'


Now my requirement is read first data set and based on the filtering
condition in second dataset need to tag rows of first input dataset by
introducing a new column TagId to first input data set
so the expected should look like :

year,make,model,comment,blank,TagId
> "2012","Tesla","S","No comment",2012_cars
> 1997,Ford,E350,"Go get one now they are going fast",1997_cars
> 2015,Chevy,Volt, ,2015_cars


I tried like :

val sqlContext = new SQLContext(sc)
> val carsSchema = StructType(Seq(
> StructField("year", IntegerType, true),
> StructField("make", StringType, true),
> StructField("model", StringType, true),
> StructField("comment", StringType, true),
> StructField("blank", StringType, true)))
>
> val carTagsSchema = StructType(Seq(
> StructField("TagId", StringType, true),
> StructField("condition", StringType, true)))
>
>
> val dfcars =
> sqlContext.read.format("com.databricks.spark.csv").option("header", "true")
> .schema(carsSchema).load("/TestDivya/Spark/cars.csv")
> val dftags =
> sqlContext.read.format("com.databricks.spark.csv").option("header", "true")
> .schema(carTagsSchema).load("/TestDivya/Spark/CarTags.csv")
>
> val Amendeddf = dfcars.withColumn("TagId", dfcars("blank"))
> val cdtnval = dftags.select("condition")
> val df2=dfcars.filter(cdtnval)
> :35: error: overloaded method value filter with alternatives:
>   (conditionExpr: String)org.apache.spark.sql.DataFrame 
>   (condition:
> org.apache.spark.sql.Column)org.apache.spark.sql.DataFrame
>  cannot be applied to (org.apache.spark.sql.DataFrame)
>val df2=dfcars.filter(cdtnval)


another way :

val col = dftags.col("TagId")
> val finaldf = dfcars.withColumn("TagId", col)
> org.apache.spark.sql.AnalysisException: resolved attribute(s) TagId#5
> missing from comment#3,blank#4,model#2,make#1,year#0 in operator !Project
> [year#0,make#1,model#2,comment#3,blank#4,TagId#5 AS TagId#8];
>
> finaldf.write.format("com.databricks.spark.csv").option("header",
> "true").save("/TestDivya/Spark/carswithtags.csv")



Would really appreciate if somebody give me pointers how can I pass the
filter condition(second dataframe) to filter function of first dataframe.
Or another solution .
My apppologies for such a naive question as I am new to scala and Spark

Thanks

Optimized way to multiply two large matrices and save output using Spark and Scala

2016-01-13 Thread Devi P.V

I want to multiply two large matrices (from csv files)using Spark and Scala
and save output.I use the following code

  val rows=file1.coalesce(1,false).map(x=>{
  val line=x.split(delimiter).map(_.toDouble)
  Vectors.sparse(line.length,
line.zipWithIndex.map(e => (e._2, e._1)).filter(_._2 != 0.0))

})

val rmat = new RowMatrix(rows)

val dm=file2.coalesce(1,false).map(x=>{
  val line=x.split(delimiter).map(_.toDouble)
  Vectors.dense(line)
})

val ma = dm.map(_.toArray).take(dm.count.toInt)
val localMat = Matrices.dense( dm.count.toInt,
  dm.take(1)(0).size,

  transpose(ma).flatten)

// Multiply two matrices
val s=rmat.multiply(localMat).rows

s.map(x=>x.toArray.mkString(delimiter)).saveAsTextFile(OutputPath)

  }

  def transpose(m: Array[Array[Double]]): Array[Array[Double]] = {
(for {
  c <- m(0).indices
} yield m.map(_(c)) ).toArray
  }

When I save file it takes more time and output file has very large in
size.what is the optimized way to multiply two large files and save the
output to a text file ?

Re: Optimized way to multiply two large matrices and save output using Spark and Scala

2016-01-13 Thread Burak Yavuz

BlockMatrix.multiply is the suggested method of multiplying two large
matrices. Is there a reason that you didn't use BlockMatrices?

You can load the matrices and convert to and from RowMatrix. If it's in
sparse format (i, j, v), then you can also use the CoordinateMatrix to
load, BlockMatrix to multiply, and CoordinateMatrix to save it back again.

Thanks,
Burak

On Wed, Jan 13, 2016 at 8:16 PM, Devi P.V <devip2...@gmail.com> wrote:

> I want to multiply two large matrices (from csv files)using Spark and
> Scala and save output.I use the following code
>
>   val rows=file1.coalesce(1,false).map(x=>{
>   val line=x.split(delimiter).map(_.toDouble)
>   Vectors.sparse(line.length,
> line.zipWithIndex.map(e => (e._2, e._1)).filter(_._2 != 0.0))
>
> })
>
> val rmat = new RowMatrix(rows)
>
> val dm=file2.coalesce(1,false).map(x=>{
>   val line=x.split(delimiter).map(_.toDouble)
>   Vectors.dense(line)
> })
>
> val ma = dm.map(_.toArray).take(dm.count.toInt)
> val localMat = Matrices.dense( dm.count.toInt,
>   dm.take(1)(0).size,
>
>   transpose(ma).flatten)
>
> // Multiply two matrices
> val s=rmat.multiply(localMat).rows
>
> s.map(x=>x.toArray.mkString(delimiter)).saveAsTextFile(OutputPath)
>
>   }
>
>   def transpose(m: Array[Array[Double]]): Array[Array[Double]] = {
> (for {
>   c <- m(0).indices
> } yield m.map(_(c)) ).toArray
>   }
>
> When I save file it takes more time and output file has very large in
> size.what is the optimized way to multiply two large files and save the
> output to a text file ?
>

Re: Pivot Data in Spark and Scala

2015-10-31 Thread ayan guha

(disclaimer: my reply in SO)

http://stackoverflow.com/questions/30260015/reshaping-pivoting-data-in-spark-rdd-and-or-spark-dataframes/30278605#30278605


On Sat, Oct 31, 2015 at 6:21 AM, Ali Tajeldin EDU <alitedu1...@gmail.com>
wrote:

> You can take a look at the smvPivot function in the SMV library (
> https://github.com/TresAmigosSD/SMV ).  Should look for method "smvPivot"
> in SmvDFHelper (
>
> http://tresamigossd.github.io/SMV/scaladocs/index.html#org.tresamigos.smv.SmvDFHelper).
> You can also perform the pivot on a group-by-group basis.  See smvPivot and
> smvPivotSum in SmvGroupedDataFunc (
> http://tresamigossd.github.io/SMV/scaladocs/index.html#org.tresamigos.smv.SmvGroupedDataFunc
> ).
>
> Docs from smvPivotSum are copied below.  Note that you don't have to
> specify the baseOutput columns, but if you don't, it will force an
> additional action on the input data frame to build the cross products of
> all possible values in your input pivot columns.
>
> Perform a normal SmvPivot operation followed by a sum on all the output
> pivot columns.
> For example:
>
> df.smvGroupBy("id").smvPivotSum(Seq("month", "product"))("count")("5_14_A", 
> "5_14_B", "6_14_A", "6_14_B")
>
> and the following input:
>
> Input
> | id  | month | product | count |
> | --- | - | --- | - |
> | 1   | 5/14  |   A |   100 |
> | 1   | 6/14  |   B |   200 |
> | 1   | 5/14  |   B |   300 |
>
> will produce the following output:
>
> | id  | count_5_14_A | count_5_14_B | count_6_14_A | count_6_14_B |
> | --- |  |  |  |  |
> | 1   | 100  | 300  | NULL | 200  |
>
> pivotCols
> The sequence of column names whose values will be used as the output pivot
> column names.
> valueCols
> The columns whose value will be copied to the pivoted output columns.
> baseOutput
> The expected base output column names (without the value column prefix).
> The user is required to supply the list of expected pivot column output
> names to avoid and extra action on the input DataFrame just to extract the
> possible pivot columns. if an empty sequence is provided, then the base
> output columns will be extracted from values in the pivot columns (will
> cause an action on the entire DataFrame!)
>
> --
> Ali
> PS: shoot me an email if you run into any issues using SMV.
>
>
> On Oct 30, 2015, at 6:33 AM, Andrianasolo Fanilo <
> fanilo.andrianas...@worldline.com> wrote:
>
> Hey,
>
> The question is tricky, here is a possible answer by defining years as
> keys for a hashmap per client and merging those :
>
>
> *import *scalaz._
> *import *Scalaz._
>
>
> *val *sc = *new *SparkContext(*"local[*]"*, *"sandbox"*)
>
>
> *// Create RDD of your objects**val *rdd = sc.parallelize(*Seq*(
>   (*"A"*, 2015, 4),
>   (*"A"*, 2014, 12),
>   (*"A"*, 2013, 1),
>   (*"B"*, 2015, 24),
>   (*"B"*, 2013, 4)
> ))
>
>
> *// Search for all the years in the RDD**val *minYear =
> rdd.map(_._2).reduce(Math.*min*)
> *// look for minimum year**val *maxYear = rdd.map(_._2).reduce(Math.*max*
> )
> *// look for maximum year**val *sequenceOfYears = maxYear to minYear by -1
>
>
>
> *// create sequence of years from max to min// Define functions to build,
> for each client, a Map of year -> value for year, and how those maps will
> be merged**def *createCombiner(obj: (Int, Int)): Map[Int, String] = 
> *Map*(obj._1
> -> obj._2.toString)
> *def *mergeValue(accum: Map[Int, String], obj: (Int, Int)) = accum +
> (obj._1 -> obj._2.toString)
> *def *mergeCombiners(accum1: Map[Int, String], accum2: Map[Int, String]) =
>  accum1 |+| accum2 *// I’m lazy so I use Scalaz to merge two maps of year
> -> value, I assume we don’t have two lines with same client and year…*
>
>
> *// For each client, check for each year from maxYear to minYear if it
> exists in the computed map. If not input blank.**val *result = rdd
>   .map { *case *obj => (obj._1, (obj._2, obj._3)) }
>   .combineByKey(createCombiner, mergeValue, mergeCombiners)
>   .map{ *case *(name, mapOfYearsToValues) => (*Seq*(name) ++
> sequenceOfYears.map(year => mapOfYearsToValues.getOrElse(year, *" "*
> ))).mkString(*","*)}* // here we assume that sequence of all years isn’t
> too big to not fit in memory. If you had to compute for each day, it may
> break and you would definitely need to use a specialized timeseries
> library…*
>
> result.foreach(*println*)
>
> sc.stop()
>
> Best regards,
>

Re: Pivot Data in Spark and Scala

2015-10-30 Thread Adrian Tanase

Its actually a bit tougher as you’ll first need all the years. Also not sure 
how you would reprsent your “columns” given they are dynamic based on the input 
data.

Depending on your downstream processing, I’d probably try to emulate it with a 
hash map with years as keys instead of the columns.

There is probably a nicer solution using the data frames API but I’m not 
familiar with it.

If you actually need vectors I think this article I saw recently on the data 
bricks blog will highlight some options (look for gather encoder)
https://databricks.com/blog/2015/10/20/audience-modeling-with-spark-ml-pipelines.html

-adrian

From: Deng Ching-Mallete
Date: Friday, October 30, 2015 at 4:35 AM
To: Ascot Moss
Cc: User
Subject: Re: Pivot Data in Spark and Scala

Hi,

You could transform it into a pair RDD then use the combineByKey function.

HTH,
Deng

On Thu, Oct 29, 2015 at 7:29 PM, Ascot Moss 
<ascot.m...@gmail.com<mailto:ascot.m...@gmail.com>> wrote:
Hi,

I have data as follows:

A, 2015, 4
A, 2014, 12
A, 2013, 1
B, 2015, 24
B, 2013 4


I need to convert the data to a new format:
A ,4,12,1
B,   24,,4

Any idea how to make it in Spark Scala?

Thanks

RE: Pivot Data in Spark and Scala

2015-10-30 Thread Andrianasolo Fanilo

Hey,

The question is tricky, here is a possible answer by defining years as keys for 
a hashmap per client and merging those :


import scalaz._
import Scalaz._

val sc = new SparkContext("local[*]", "sandbox")

// Create RDD of your objects
val rdd = sc.parallelize(Seq(
  ("A", 2015, 4),
  ("A", 2014, 12),
  ("A", 2013, 1),
  ("B", 2015, 24),
  ("B", 2013, 4)
))

// Search for all the years in the RDD
val minYear = rdd.map(_._2).reduce(Math.min)// look for minimum year
val maxYear = rdd.map(_._2).reduce(Math.max)// look for maximum year
val sequenceOfYears = maxYear to minYear by -1 // create sequence of years from 
max to min

// Define functions to build, for each client, a Map of year -> value for year, 
and how those maps will be merged
def createCombiner(obj: (Int, Int)): Map[Int, String] = Map(obj._1 -> 
obj._2.toString)
def mergeValue(accum: Map[Int, String], obj: (Int, Int)) = accum + (obj._1 -> 
obj._2.toString)
def mergeCombiners(accum1: Map[Int, String], accum2: Map[Int, String]) = accum1 
|+| accum2 // I’m lazy so I use Scalaz to merge two maps of year -> value, I 
assume we don’t have two lines with same client and year…

// For each client, check for each year from maxYear to minYear if it exists in 
the computed map. If not input blank.
val result = rdd
  .map { case obj => (obj._1, (obj._2, obj._3)) }
  .combineByKey(createCombiner, mergeValue, mergeCombiners)
  .map{ case (name, mapOfYearsToValues) => (Seq(name) ++ 
sequenceOfYears.map(year => mapOfYearsToValues.getOrElse(year, " 
"))).mkString(",")} // here we assume that sequence of all years isn’t too big 
to not fit in memory. If you had to compute for each day, it may break and you 
would definitely need to use a specialized timeseries library…

result.foreach(println)

sc.stop()

Best regards,
Fanilo

De : Adrian Tanase [mailto:atan...@adobe.com]
Envoyé : vendredi 30 octobre 2015 11:50
À : Deng Ching-Mallete; Ascot Moss
Cc : User
Objet : Re: Pivot Data in Spark and Scala

Its actually a bit tougher as you’ll first need all the years. Also not sure 
how you would reprsent your “columns” given they are dynamic based on the input 
data.

Depending on your downstream processing, I’d probably try to emulate it with a 
hash map with years as keys instead of the columns.

There is probably a nicer solution using the data frames API but I’m not 
familiar with it.

If you actually need vectors I think this article I saw recently on the data 
bricks blog will highlight some options (look for gather encoder)
https://databricks.com/blog/2015/10/20/audience-modeling-with-spark-ml-pipelines.html

-adrian

From: Deng Ching-Mallete
Date: Friday, October 30, 2015 at 4:35 AM
To: Ascot Moss
Cc: User
Subject: Re: Pivot Data in Spark and Scala

Hi,

You could transform it into a pair RDD then use the combineByKey function.

HTH,
Deng

On Thu, Oct 29, 2015 at 7:29 PM, Ascot Moss 
<ascot.m...@gmail.com<mailto:ascot.m...@gmail.com>> wrote:
Hi,

I have data as follows:

A, 2015, 4
A, 2014, 12
A, 2013, 1
B, 2015, 24
B, 2013 4


I need to convert the data to a new format:
A ,4,12,1
B,   24,,4

Any idea how to make it in Spark Scala?

Thanks





Ce message et les pièces jointes sont confidentiels et réservés à l'usage 
exclusif de ses destinataires. Il peut également être protégé par le secret 
professionnel. Si vous recevez ce message par erreur, merci d'en avertir 
immédiatement l'expéditeur et de le détruire. L'intégrité du message ne pouvant 
être assurée sur Internet, la responsabilité de Worldline ne pourra être 
recherchée quant au contenu de ce message. Bien que les meilleurs efforts 
soient faits pour maintenir cette transmission exempte de tout virus, 
l'expéditeur ne donne aucune garantie à cet égard et sa responsabilité ne 
saurait être recherchée pour tout dommage résultant d'un virus transmis.

This e-mail and the documents attached are confidential and intended solely for 
the addressee; it may also be privileged. If you receive this e-mail in error, 
please notify the sender immediately and destroy it. As its integrity cannot be 
secured on the Internet, the Worldline liability cannot be triggered for the 
message content. Although the sender endeavours to maintain a computer 
virus-free network, the sender does not warrant that this transmission is 
virus-free and will not be liable for any damages resulting from any virus 
transmitted.

Re: Pivot Data in Spark and Scala

2015-10-30 Thread Ali Tajeldin EDU

You can take a look at the smvPivot function in the SMV library ( 
https://github.com/TresAmigosSD/SMV ).  Should look for method "smvPivot" in 
SmvDFHelper (
http://tresamigossd.github.io/SMV/scaladocs/index.html#org.tresamigos.smv.SmvDFHelper).
  You can also perform the pivot on a group-by-group basis.  See smvPivot and 
smvPivotSum in SmvGroupedDataFunc 
(http://tresamigossd.github.io/SMV/scaladocs/index.html#org.tresamigos.smv.SmvGroupedDataFunc).

Docs from smvPivotSum are copied below.  Note that you don't have to specify 
the baseOutput columns, but if you don't, it will force an additional action on 
the input data frame to build the cross products of all possible values in your 
input pivot columns. 

Perform a normal SmvPivot operation followed by a sum on all the output pivot 
columns.
For example:
df.smvGroupBy("id").smvPivotSum(Seq("month", "product"))("count")("5_14_A", 
"5_14_B", "6_14_A", "6_14_B")
and the following input:
Input
| id  | month | product | count |
| --- | - | --- | - |
| 1   | 5/14  |   A |   100 |
| 1   | 6/14  |   B |   200 |
| 1   | 5/14  |   B |   300 |
will produce the following output:
| id  | count_5_14_A | count_5_14_B | count_6_14_A | count_6_14_B |
| --- |  |  |  |  |
| 1   | 100  | 300  | NULL | 200  |
pivotCols
The sequence of column names whose values will be used as the output pivot 
column names.
valueCols
The columns whose value will be copied to the pivoted output columns.
baseOutput
The expected base output column names (without the value column prefix). The 
user is required to supply the list of expected pivot column output names to 
avoid and extra action on the input DataFrame just to extract the possible 
pivot columns. if an empty sequence is provided, then the base output columns 
will be extracted from values in the pivot columns (will cause an action on the 
entire DataFrame!)

--
Ali
PS: shoot me an email if you run into any issues using SMV.


On Oct 30, 2015, at 6:33 AM, Andrianasolo Fanilo 
<fanilo.andrianas...@worldline.com> wrote:

> Hey,
>  
> The question is tricky, here is a possible answer by defining years as keys 
> for a hashmap per client and merging those :
>  
> import scalaz._
> import Scalaz._
>  
> val sc = new SparkContext("local[*]", "sandbox")
> 
> // Create RDD of your objects
> val rdd = sc.parallelize(Seq(
>   ("A", 2015, 4),
>   ("A", 2014, 12),
>   ("A", 2013, 1),
>   ("B", 2015, 24),
>   ("B", 2013, 4)
> ))
> 
> // Search for all the years in the RDD
> val minYear = rdd.map(_._2).reduce(Math.min)// look for minimum year
> val maxYear = rdd.map(_._2).reduce(Math.max)// look for maximum year
> val sequenceOfYears = maxYear to minYear by -1 // create sequence of years 
> from max to min
> 
> // Define functions to build, for each client, a Map of year -> value for 
> year, and how those maps will be merged
> def createCombiner(obj: (Int, Int)): Map[Int, String] = Map(obj._1 -> 
> obj._2.toString)
> def mergeValue(accum: Map[Int, String], obj: (Int, Int)) = accum + (obj._1 -> 
> obj._2.toString)
> def mergeCombiners(accum1: Map[Int, String], accum2: Map[Int, String]) = 
> accum1 |+| accum2 // I’m lazy so I use Scalaz to merge two maps of year -> 
> value, I assume we don’t have two lines with same client and year…
> 
> // For each client, check for each year from maxYear to minYear if it exists 
> in the computed map. If not input blank.
> val result = rdd
>   .map { case obj => (obj._1, (obj._2, obj._3)) }
>   .combineByKey(createCombiner, mergeValue, mergeCombiners)
>   .map{ case (name, mapOfYearsToValues) => (Seq(name) ++ 
> sequenceOfYears.map(year => mapOfYearsToValues.getOrElse(year, " 
> "))).mkString(",")} // here we assume that sequence of all years isn’t too 
> big to not fit in memory. If you had to compute for each day, it may break 
> and you would definitely need to use a specialized timeseries library…
> 
> result.foreach(println)
> 
> sc.stop()
>  
> Best regards,
> Fanilo
>  
> De : Adrian Tanase [mailto:atan...@adobe.com] 
> Envoyé : vendredi 30 octobre 2015 11:50
> À : Deng Ching-Mallete; Ascot Moss
> Cc : User
> Objet : Re: Pivot Data in Spark and Scala
>  
> Its actually a bit tougher as you’ll first need all the years. Also not sure 
> how you would reprsent your “columns” given they are dynamic based on the 
> input data.
>  
> Depending on your downstream processing, I’d probably try to emulate it with 
> a hash map with years as keys instead of the columns.
>  
> There is proba

Re: Pivot Data in Spark and Scala

2015-10-30 Thread Ruslan Dautkhanov

https://issues.apache.org/jira/browse/SPARK-8992

Should be in 1.6?



-- 
Ruslan Dautkhanov

On Thu, Oct 29, 2015 at 5:29 AM, Ascot Moss <ascot.m...@gmail.com> wrote:

> Hi,
>
> I have data as follows:
>
> A, 2015, 4
> A, 2014, 12
> A, 2013, 1
> B, 2015, 24
> B, 2013 4
>
>
> I need to convert the data to a new format:
> A ,4,12,1
> B,   24,    ,4
>
> Any idea how to make it in Spark Scala?
>
> Thanks
>
>

Pivot Data in Spark and Scala

2015-10-29 Thread Ascot Moss

Hi,

I have data as follows:

A, 2015, 4
A, 2014, 12
A, 2013, 1
B, 2015, 24
B, 2013 4


I need to convert the data to a new format:
A ,4,12,1
B,   24,,4

Any idea how to make it in Spark Scala?

Thanks

Re: Pivot Data in Spark and Scala

2015-10-29 Thread Deng Ching-Mallete

Hi,

You could transform it into a pair RDD then use the combineByKey function.

HTH,
Deng

On Thu, Oct 29, 2015 at 7:29 PM, Ascot Moss <ascot.m...@gmail.com> wrote:

> Hi,
>
> I have data as follows:
>
> A, 2015, 4
> A, 2014, 12
> A, 2013, 1
> B, 2015, 24
> B, 2013 4
>
>
> I need to convert the data to a new format:
> A ,4,12,1
> B,   24,    ,4
>
> Any idea how to make it in Spark Scala?
>
> Thanks
>
>

Re: spark and scala-2.11

2015-08-24 Thread Lanny Ripple

We're going to be upgrading from spark 1.0.2 and using hadoop-1.2.1 so need
to build by hand.  (Yes, I know. Use hadoop-2.x but standard resource
constraints apply.)  I want to build against scala-2.11 and publish to our
artifact repository but finding build/spark-2.10.4 and tracing down what
build/mvn was doing had me concerned that I was missing something.  I'll
hold the course and build it as instructed.

Thanks for the info, all.

PS - Since asked -- PATH=./build/apache-maven-3.2.5/bin:$PATH; build/mvn
-Phadoop-1 -Dhadoop.version=1.2.1 -Dscala-2.11 -DskipTests package

On Mon, Aug 24, 2015 at 2:49 PM, Jonathan Coveney jcove...@gmail.com
wrote:

 I've used the instructions and it worked fine.

 Can you post exactly what you're doing, and what it fails with? Or are you
 just trying to understand how it works?

 2015-08-24 15:48 GMT-04:00 Lanny Ripple la...@spotright.com:

 Hello,

 The instructions for building spark against scala-2.11 indicate using
 -Dspark-2.11.  When I look in the pom.xml I find a profile named
 'spark-2.11' but nothing that would indicate I should set a property.  The
 sbt build seems to need the -Dscala-2.11 property set.  Finally build/mvn
 does a simple grep of scala.version (which doesn't change after running dev/
 change-version-to-2.11.sh) so the build seems to be grabbing the 2.10.4
 scala library.

 Anyone know (from having done it and used it in production) if the build
 instructions for spark-1.4.1 against Scala-2.11 are correct?

 Thanks.
   -Lanny

Re: spark and scala-2.11

2015-08-24 Thread Jonathan Coveney

I've used the instructions and it worked fine.

Can you post exactly what you're doing, and what it fails with? Or are you
just trying to understand how it works?

2015-08-24 15:48 GMT-04:00 Lanny Ripple la...@spotright.com:

 Hello,

 The instructions for building spark against scala-2.11 indicate using
 -Dspark-2.11.  When I look in the pom.xml I find a profile named
 'spark-2.11' but nothing that would indicate I should set a property.  The
 sbt build seems to need the -Dscala-2.11 property set.  Finally build/mvn
 does a simple grep of scala.version (which doesn't change after running dev/
 change-version-to-2.11.sh) so the build seems to be grabbing the 2.10.4
 scala library.

 Anyone know (from having done it and used it in production) if the build
 instructions for spark-1.4.1 against Scala-2.11 are correct?

 Thanks.
   -Lanny

Re: spark and scala-2.11

2015-08-24 Thread Sean Owen

The property scala-2.11 triggers the profile scala-2.11 -- and
additionally disables the scala-2.10 profile, so that's the way to do
it. But yes, you also need to run the script before-hand to set up the
build for Scala 2.11 as well.

On Mon, Aug 24, 2015 at 8:48 PM, Lanny Ripple la...@spotright.com wrote:
 Hello,

 The instructions for building spark against scala-2.11 indicate using
 -Dspark-2.11.  When I look in the pom.xml I find a profile named
 'spark-2.11' but nothing that would indicate I should set a property.  The
 sbt build seems to need the -Dscala-2.11 property set.  Finally build/mvn
 does a simple grep of scala.version (which doesn't change after running
 dev/change-version-to-2.11.sh) so the build seems to be grabbing the 2.10.4
 scala library.

 Anyone know (from having done it and used it in production) if the build
 instructions for spark-1.4.1 against Scala-2.11 are correct?

 Thanks.
   -Lanny

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

spark and scala-2.11

2015-08-24 Thread Lanny Ripple

Hello,

The instructions for building spark against scala-2.11 indicate using
-Dspark-2.11.  When I look in the pom.xml I find a profile named
'spark-2.11' but nothing that would indicate I should set a property.  The
sbt build seems to need the -Dscala-2.11 property set.  Finally build/mvn
does a simple grep of scala.version (which doesn't change after running dev/
change-version-to-2.11.sh) so the build seems to be grabbing the 2.10.4
scala library.

Anyone know (from having done it and used it in production) if the build
instructions for spark-1.4.1 against Scala-2.11 are correct?

Thanks.
  -Lanny

Re: Spark on scala 2.11 build fails due to incorrect jline dependency in REPL

2015-08-17 Thread Ted Yu

You were building against 1.4.x, right ?

In master branch, switch-to-scala-2.11.sh is gone. There is scala-2.11
profile.

FYI

On Sun, Aug 16, 2015 at 11:12 AM, Stephen Boesch java...@gmail.com wrote:


 I am building spark with the following options - most notably the
 **scala-2.11**:

  . dev/switch-to-scala-2.11.sh
 mvn -Phive -Pyarn -Phadoop-2.6 -Dhadoop2.6.2 -Pscala-2.11 -DskipTests
 -Dmaven.javadoc.skip=true clean package


 The build goes pretty far but fails in one of the minor modules *repl*:

 [INFO]
 
 [ERROR] Failed to execute goal on project spark-repl_2.11: Could not
 resolve dependencies
 for project org.apache.spark:spark-repl_2.11:jar:1.5.0-SNAPSHOT:
  Could not   find artifact org.scala-lang:jline:jar:2.11.7 in central
  (https://repo1.maven.org/maven2) - [Help 1]

 Upon investigation - from 2.11.5 and later the scala version of jline is
 no longer required: they use the default jline distribution.

 And in fact the repl only shows dependency on jline for the 2.10.4 scala
 version:

 profile
   idscala-2.10/id
   activation
 propertyname!scala-2.11/name/property
   /activation
   properties
 scala.version2.10.4/scala.version
 scala.binary.version2.10/scala.binary.version
 jline.version${scala.version}/jline.version
 jline.groupidorg.scala-lang/jline.groupid
   /properties
   dependencyManagement
 dependencies
   dependency
 groupId${jline.groupid}/groupId
 artifactIdjline/artifactId
 version${jline.version}/version
   /dependency
 /dependencies
   /dependencyManagement
 /profile

 So then it is not clear why this error is occurring. Pointers appreciated.

Re: Spark on scala 2.11 build fails due to incorrect jline dependency in REPL

2015-08-17 Thread Stephen Boesch

In 1.4 it is change-scala-version.sh  2.11

But the problem was it is a -Dscala-211  not  a -P.  I misread the doc's.

2015-08-17 14:17 GMT-07:00 Ted Yu yuzhih...@gmail.com:

 You were building against 1.4.x, right ?

 In master branch, switch-to-scala-2.11.sh is gone. There is scala-2.11
 profile.

 FYI

 On Sun, Aug 16, 2015 at 11:12 AM, Stephen Boesch java...@gmail.com
 wrote:


 I am building spark with the following options - most notably the
 **scala-2.11**:

  . dev/switch-to-scala-2.11.sh
 mvn -Phive -Pyarn -Phadoop-2.6 -Dhadoop2.6.2 -Pscala-2.11 -DskipTests
 -Dmaven.javadoc.skip=true clean package


 The build goes pretty far but fails in one of the minor modules *repl*:

 [INFO]
 
 [ERROR] Failed to execute goal on project spark-repl_2.11: Could not
 resolve dependencies
 for project org.apache.spark:spark-repl_2.11:jar:1.5.0-SNAPSHOT:
  Could not   find artifact org.scala-lang:jline:jar:2.11.7 in central
  (https://repo1.maven.org/maven2) - [Help 1]

 Upon investigation - from 2.11.5 and later the scala version of jline is
 no longer required: they use the default jline distribution.

 And in fact the repl only shows dependency on jline for the 2.10.4 scala
 version:

 profile
   idscala-2.10/id
   activation
 propertyname!scala-2.11/name/property
   /activation
   properties
 scala.version2.10.4/scala.version
 scala.binary.version2.10/scala.binary.version
 jline.version${scala.version}/jline.version
 jline.groupidorg.scala-lang/jline.groupid
   /properties
   dependencyManagement
 dependencies
   dependency
 groupId${jline.groupid}/groupId
 artifactIdjline/artifactId
 version${jline.version}/version
   /dependency
 /dependencies
   /dependencyManagement
 /profile

 So then it is not clear why this error is occurring. Pointers appreciated.

Spark on scala 2.11 build fails due to incorrect jline dependency in REPL

2015-08-16 Thread Stephen Boesch

I am building spark with the following options - most notably the
**scala-2.11**:

 . dev/switch-to-scala-2.11.sh
mvn -Phive -Pyarn -Phadoop-2.6 -Dhadoop2.6.2 -Pscala-2.11 -DskipTests
-Dmaven.javadoc.skip=true clean package


The build goes pretty far but fails in one of the minor modules *repl*:

[INFO]

[ERROR] Failed to execute goal on project spark-repl_2.11: Could not
resolve dependencies
for project org.apache.spark:spark-repl_2.11:jar:1.5.0-SNAPSHOT:
 Could not   find artifact org.scala-lang:jline:jar:2.11.7 in central
 (https://repo1.maven.org/maven2) - [Help 1]

Upon investigation - from 2.11.5 and later the scala version of jline is no
longer required: they use the default jline distribution.

And in fact the repl only shows dependency on jline for the 2.10.4 scala
version:

profile
  idscala-2.10/id
  activation
propertyname!scala-2.11/name/property
  /activation
  properties
scala.version2.10.4/scala.version
scala.binary.version2.10/scala.binary.version
jline.version${scala.version}/jline.version
jline.groupidorg.scala-lang/jline.groupid
  /properties
  dependencyManagement
dependencies
  dependency
groupId${jline.groupid}/groupId
artifactIdjline/artifactId
version${jline.version}/version
  /dependency
/dependencies
  /dependencyManagement
/profile

So then it is not clear why this error is occurring. Pointers appreciated.

Re: Finding moving average using Spark and Scala

2015-07-14 Thread Feynman Liang

() on it.

 Anupam Bagchi


 On Jul 13, 2015, at 4:52 PM, Feynman Liang fli...@databricks.com wrote:

 A good example is RegressionMetrics
 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/evaluation/RegressionMetrics.scala#L48's
 use of of OnlineMultivariateSummarizer to aggregate statistics across
 labels and residuals; take a look at how aggregateByKey is used there.

 On Mon, Jul 13, 2015 at 4:50 PM, Anupam Bagchi 
 anupam_bag...@rocketmail.com wrote:

 Thank you Feynman for your response. Since I am very new to Scala I may
 need a bit more hand-holding at this stage.

 I have been able to incorporate your suggestion about sorting - and it
 now works perfectly. Thanks again for that.

 I tried to use your suggestion of using MultiVariateOnlineSummarizer,
 but could not proceed further. For each deviceid (the key) my goal is to
 get a vector of doubles on which I can query the mean and standard
 deviation. Now because RDDs are immutable, I cannot use a foreach loop to
 interate through the groupby results and individually add the values in an
 RDD - Spark does not allow that. I need to apply the RDD functions directly
 on the entire set to achieve the transformations I need. This is where I am
 faltering since I am not used to the lambda expressions that Scala uses.

 object DeviceAnalyzer {
   def main(args: Array[String]) {
 val sparkConf = new SparkConf().setAppName(Device Analyzer)
 val sc = new SparkContext(sparkConf)

 val logFile = args(0)

 val deviceAggregateLogs = 
 sc.textFile(logFile).map(DailyDeviceAggregates.parseLogLine).cache()

 // Calculate statistics based on bytes
 val deviceIdsMap = deviceAggregateLogs.groupBy(_.device_id)

 // Question: Can we not write the line above as 
 deviceAggregateLogs.groupBy(_.device_id).sortBy(c = c_.2, true) // 
 Anything wrong?

 // All I need to do below is collect the vector of bytes for each 
 device and store it in the RDD

 // The problem with the ‘foreach' approach below, is that it generates 
 the vector values one at a time, which I cannot

 // add individually to an immutable RDD

 deviceIdsMap.foreach(a = {
   val device_id = a._1  // This is the device ID
   val allaggregates = a._2  // This is an array of all 
 device-aggregates for this device

   val sortedaggregates = allaggregates.toArray  
 Sorting.quickSort(sortedaggregates)

   val byteValues = sortedaggregates.map(dda = 
 dda.bytes.toDouble).toArray
   val count = byteValues.count(A = true)
   val sum = byteValues.sum
   val xbar = sum / count
   val sum_x_minus_x_bar_square = byteValues.map(x = 
 (x-xbar)*(x-xbar)).sum
   val stddev = math.sqrt(sum_x_minus_x_bar_square / count)

   val vector: Vector = Vectors.dense(byteValues)
   println(vector)
   println(device_id + , + xbar + , + stddev)
 })

   //val vector: Vector = Vectors.dense(byteValues)
   //println(vector)
   //val summary: MultivariateStatisticalSummary = 
 Statistics.colStats(vector)


 sc.stop() } }

 Can you show me how to write the ‘foreach’ loop in a Spark-friendly way?
 Thanks a lot for your help.

 Anupam Bagchi


 On Jul 13, 2015, at 12:21 PM, Feynman Liang fli...@databricks.com
 wrote:

 The call to Sorting.quicksort is not working. Perhaps I am calling it
 the wrong way.

 allaggregates.toArray allocates and creates a new array separate from
 allaggregates which is sorted by Sorting.quickSort; allaggregates. Try:
 val sortedAggregates = allaggregates.toArray
 Sorting.quickSort(sortedAggregates)

 I would like to use the Spark mllib class
 MultivariateStatisticalSummary to calculate the statistical values.

 MultivariateStatisticalSummary is a trait (similar to a Java interface);
 you probably want to use MultivariateOnlineSummarizer.

 For that I would need to keep all my intermediate values as RDD so that
 I can directly use the RDD methods to do the job.

 Correct; you would do an aggregate using the add and merge functions
 provided by MultivariateOnlineSummarizer

 At the end I also need to write the results to HDFS for which there is
 a method provided on the RDD class to do so, which is another reason I
 would like to retain everything as RDD.

 You can write the RDD[(device_id, MultivariateOnlineSummarizer)] to
 HDFS, or you could unpack the relevant statistics from
 MultivariateOnlineSummarizer into an array/tuple using a mapValues first
 and then write.

 On Mon, Jul 13, 2015 at 10:07 AM, Anupam Bagchi 
 anupam_bag...@rocketmail.com wrote:

 I have to do the following tasks on a dataset using Apache Spark with
 Scala as the programming language:

1. Read the dataset from HDFS. A few sample lines look like this:

  
 deviceid,bytes,eventdate15590657,246620,2015063014066921,1907,2015062114066921,1906,201506266522013,2349,201506266522013,2525,20150613


1. Group the data by device id. Thus we now have a map of deviceid
= (bytes,eventdate)
2. For each

Re: Finding moving average using Spark and Scala

2015-07-13 Thread Anupam Bagchi

Thank you Feynman for the lead.

I was able to modify the code using clues from the RegressionMetrics example. 
Here is what I got now.

val deviceAggregateLogs = 
sc.textFile(logFile).map(DailyDeviceAggregates.parseLogLine).cache()

// Calculate statistics based on bytes-transferred
val deviceIdsMap = deviceAggregateLogs.groupBy(_.device_id)
println(deviceIdsMap.collect().deep.mkString(\n))

val summary: MultivariateStatisticalSummary = {
  val summary: MultivariateStatisticalSummary = deviceIdsMap.map {
case (deviceId, allaggregates) = Vectors.dense({
  val sortedAggregates = allaggregates.toArray
  Sorting.quickSort(sortedAggregates)
  sortedAggregates.map(dda = dda.bytes.toDouble)
})
  }.aggregate(new MultivariateOnlineSummarizer())(
  (summary, v) = summary.add(v),  // Not sure if this is really what I 
want, it just came from the example
  (sum1, sum2) = sum1.merge(sum2) // Same doubt here as well
)
  summary
}
It compiles fine. But I am now getting an exception as follows at Runtime.

Exception in thread main org.apache.spark.SparkException: Job aborted due to 
stage failure: Task 1 in stage 3.0 failed 1 times, most recent failure: Lost 
task 1.0 in stage 3.0 (TID 5, localhost): java.lang.IllegalArgumentException: 
requirement failed: Dimensions mismatch when adding new sample. Expecting 8 but 
got 14.
at scala.Predef$.require(Predef.scala:233)
at 
org.apache.spark.mllib.stat.MultivariateOnlineSummarizer.add(MultivariateOnlineSummarizer.scala:70)
at 
com.aeris.analytics.machinelearning.statistics.DailyDeviceStatisticsAnalyzer$$anonfun$4.apply(DailyDeviceStatisticsAnalyzer.scala:41)
at 
com.aeris.analytics.machinelearning.statistics.DailyDeviceStatisticsAnalyzer$$anonfun$4.apply(DailyDeviceStatisticsAnalyzer.scala:41)
at 
scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
at 
scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)
at scala.collection.AbstractIterator.aggregate(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$26.apply(RDD.scala:966)
at org.apache.spark.rdd.RDD$$anonfun$26.apply(RDD.scala:966)
at 
org.apache.spark.SparkContext$$anonfun$32.apply(SparkContext.scala:1533)
at 
org.apache.spark.SparkContext$$anonfun$32.apply(SparkContext.scala:1533)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)

Can’t tell where exactly I went wrong. Also, how do I take the 
MultivariateOnlineSummary object and write it to HDFS? I have the 
MultivariateOnlineSummary object with me, but I really need an RDD to call 
saveAsTextFile() on it.

Anupam Bagchi
(c) 408.431.0780 (h) 408-873-7909

 On Jul 13, 2015, at 4:52 PM, Feynman Liang fli...@databricks.com wrote:
 
 A good example is RegressionMetrics 
 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/evaluation/RegressionMetrics.scala#L48's
  use of of OnlineMultivariateSummarizer to aggregate statistics across labels 
 and residuals; take a look at how aggregateByKey is used there.
 
 On Mon, Jul 13, 2015 at 4:50 PM, Anupam Bagchi anupam_bag...@rocketmail.com 
 mailto:anupam_bag...@rocketmail.com wrote:
 Thank you Feynman for your response. Since I am very new to Scala I may need 
 a bit more hand-holding at this stage.
 
 I have been able to incorporate your suggestion about sorting - and it now 
 works perfectly. Thanks again for that.
 
 I tried to use your suggestion of using MultiVariateOnlineSummarizer, but 
 could not proceed further. For each deviceid (the key) my goal is to get a 
 vector of doubles on which I can query the mean and standard deviation. Now 
 because RDDs are immutable, I cannot use a foreach loop to interate through 
 the groupby results and individually add the values in an RDD - Spark does 
 not allow that. I need to apply the RDD functions directly on the entire set 
 to achieve the transformations I need. This is where I am faltering since I 
 am not used to the lambda expressions that Scala uses.
 
 object DeviceAnalyzer {
   def main(args: Array[String]) {
 val sparkConf = new SparkConf().setAppName(Device

Re: Finding moving average using Spark and Scala

2015-07-13 Thread Feynman Liang

Dimensions mismatch when adding new sample. Expecting 8 but got 14.

Make sure all the vectors you are summarizing over have the same dimension.

Why would you want to write a MultivariateOnlineSummary object (which can
be represented with a couple Double's) into a distributed filesystem like
HDFS?

On Mon, Jul 13, 2015 at 6:54 PM, Anupam Bagchi anupam_bag...@rocketmail.com
 wrote:

 Thank you Feynman for the lead.

 I was able to modify the code using clues from the RegressionMetrics
 example. Here is what I got now.

 val deviceAggregateLogs = 
 sc.textFile(logFile).map(DailyDeviceAggregates.parseLogLine).cache()

 // Calculate statistics based on bytes-transferred
 val deviceIdsMap = deviceAggregateLogs.groupBy(_.device_id)
 println(deviceIdsMap.collect().deep.mkString(\n))

 val summary: MultivariateStatisticalSummary = {
   val summary: MultivariateStatisticalSummary = deviceIdsMap.map {
 case (deviceId, allaggregates) = Vectors.dense({
   val sortedAggregates = allaggregates.toArray
   Sorting.quickSort(sortedAggregates)
   sortedAggregates.map(dda = dda.bytes.toDouble)
 })
   }.aggregate(new MultivariateOnlineSummarizer())(
   (summary, v) = summary.add(v),  // Not sure if this is really what I 
 want, it just came from the example
   (sum1, sum2) = sum1.merge(sum2) // Same doubt here as well
 )
   summary
 }

 It compiles fine. But I am now getting an exception as follows at Runtime.

 Exception in thread main org.apache.spark.SparkException: Job aborted
 due to stage failure: Task 1 in stage 3.0 failed 1 times, most recent
 failure: Lost task 1.0 in stage 3.0 (TID 5, localhost):
 java.lang.IllegalArgumentException: requirement failed: Dimensions mismatch
 when adding new sample. Expecting 8 but got 14.
 at scala.Predef$.require(Predef.scala:233)
 at
 org.apache.spark.mllib.stat.MultivariateOnlineSummarizer.add(MultivariateOnlineSummarizer.scala:70)
 at
 com.aeris.analytics.machinelearning.statistics.DailyDeviceStatisticsAnalyzer$$anonfun$4.apply(DailyDeviceStatisticsAnalyzer.scala:41)
 at
 com.aeris.analytics.machinelearning.statistics.DailyDeviceStatisticsAnalyzer$$anonfun$4.apply(DailyDeviceStatisticsAnalyzer.scala:41)
 at
 scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
 at
 scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at
 scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
 at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)
 at
 scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)
 at scala.collection.AbstractIterator.aggregate(Iterator.scala:1157)
 at org.apache.spark.rdd.RDD$$anonfun$26.apply(RDD.scala:966)
 at org.apache.spark.rdd.RDD$$anonfun$26.apply(RDD.scala:966)
 at
 org.apache.spark.SparkContext$$anonfun$32.apply(SparkContext.scala:1533)
 at
 org.apache.spark.SparkContext$$anonfun$32.apply(SparkContext.scala:1533)
 at
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:64)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:722)

 Can’t tell where exactly I went wrong. Also, how do I take the
 MultivariateOnlineSummary object and write it to HDFS? I have the
 MultivariateOnlineSummary object with me, but I really need an RDD to call
 saveAsTextFile() on it.

 Anupam Bagchi
 (c) 408.431.0780 (h) 408-873-7909

 On Jul 13, 2015, at 4:52 PM, Feynman Liang fli...@databricks.com wrote:

 A good example is RegressionMetrics
 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/evaluation/RegressionMetrics.scala#L48's
 use of of OnlineMultivariateSummarizer to aggregate statistics across
 labels and residuals; take a look at how aggregateByKey is used there.

 On Mon, Jul 13, 2015 at 4:50 PM, Anupam Bagchi 
 anupam_bag...@rocketmail.com wrote:

 Thank you Feynman for your response. Since I am very new to Scala I may
 need a bit more hand-holding at this stage.

 I have been able to incorporate your suggestion about sorting - and it
 now works perfectly. Thanks again for that.

 I tried to use your suggestion of using MultiVariateOnlineSummarizer, but
 could not proceed further. For each deviceid (the key) my goal is to get a
 vector of doubles on which I can query the mean and standard deviation. Now
 because RDDs are immutable, I cannot use a foreach loop to interate through
 the groupby

Re: Finding moving average using Spark and Scala

2015-07-13 Thread Anupam Bagchi

Hello Feynman,

Actually in my case, the vectors I am summarizing over will not have the same 
dimension since many devices will be inactive on some days. This is at best a 
sparse matrix where we take only the active days and attempt to fit a moving 
average over it.

The reason I would like to save it to HDFS is that there are really several 
million (almost a billion) devices for which this data needs to be written. I 
am perhaps writing a very few columns, but the number of rows is pretty large.

Given the above two cases, is using MultivariateOnlineSummarizer not a good 
idea then?

Anupam Bagchi


 On Jul 13, 2015, at 7:06 PM, Feynman Liang fli...@databricks.com wrote:
 
 Dimensions mismatch when adding new sample. Expecting 8 but got 14.
 
 Make sure all the vectors you are summarizing over have the same dimension.
 
 Why would you want to write a MultivariateOnlineSummary object (which can be 
 represented with a couple Double's) into a distributed filesystem like HDFS?
 
 On Mon, Jul 13, 2015 at 6:54 PM, Anupam Bagchi anupam_bag...@rocketmail.com 
 mailto:anupam_bag...@rocketmail.com wrote:
 Thank you Feynman for the lead.
 
 I was able to modify the code using clues from the RegressionMetrics example. 
 Here is what I got now.
 
 val deviceAggregateLogs = 
 sc.textFile(logFile).map(DailyDeviceAggregates.parseLogLine).cache()
 
 // Calculate statistics based on bytes-transferred
 val deviceIdsMap = deviceAggregateLogs.groupBy(_.device_id)
 println(deviceIdsMap.collect().deep.mkString(\n))
 
 val summary: MultivariateStatisticalSummary = {
   val summary: MultivariateStatisticalSummary = deviceIdsMap.map {
 case (deviceId, allaggregates) = Vectors.dense({
   val sortedAggregates = allaggregates.toArray
   Sorting.quickSort(sortedAggregates)
   sortedAggregates.map(dda = dda.bytes.toDouble)
 })
   }.aggregate(new MultivariateOnlineSummarizer())(
   (summary, v) = summary.add(v),  // Not sure if this is really what I 
 want, it just came from the example
   (sum1, sum2) = sum1.merge(sum2) // Same doubt here as well
 )
   summary
 }
 It compiles fine. But I am now getting an exception as follows at Runtime.
 
 Exception in thread main org.apache.spark.SparkException: Job aborted due 
 to stage failure: Task 1 in stage 3.0 failed 1 times, most recent failure: 
 Lost task 1.0 in stage 3.0 (TID 5, localhost): 
 java.lang.IllegalArgumentException: requirement failed: Dimensions mismatch 
 when adding new sample. Expecting 8 but got 14.
 at scala.Predef$.require(Predef.scala:233)
 at 
 org.apache.spark.mllib.stat.MultivariateOnlineSummarizer.add(MultivariateOnlineSummarizer.scala:70)
 at 
 com.aeris.analytics.machinelearning.statistics.DailyDeviceStatisticsAnalyzer$$anonfun$4.apply(DailyDeviceStatisticsAnalyzer.scala:41)
 at 
 com.aeris.analytics.machinelearning.statistics.DailyDeviceStatisticsAnalyzer$$anonfun$4.apply(DailyDeviceStatisticsAnalyzer.scala:41)
 at 
 scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
 at 
 scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at 
 scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
 at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)
 at 
 scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)
 at scala.collection.AbstractIterator.aggregate(Iterator.scala:1157)
 at org.apache.spark.rdd.RDD$$anonfun$26.apply(RDD.scala:966)
 at org.apache.spark.rdd.RDD$$anonfun$26.apply(RDD.scala:966)
 at 
 org.apache.spark.SparkContext$$anonfun$32.apply(SparkContext.scala:1533)
 at 
 org.apache.spark.SparkContext$$anonfun$32.apply(SparkContext.scala:1533)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:64)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:722)
 
 Can’t tell where exactly I went wrong. Also, how do I take the 
 MultivariateOnlineSummary object and write it to HDFS? I have the 
 MultivariateOnlineSummary object with me, but I really need an RDD to call 
 saveAsTextFile() on it.
 
 Anupam Bagchi
 
 
 On Jul 13, 2015, at 4:52 PM, Feynman Liang fli...@databricks.com 
 mailto:fli...@databricks.com wrote:
 
 A good example is RegressionMetrics 
 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/evaluation/RegressionMetrics.scala#L48's
  use

Finding moving average using Spark and Scala

2015-07-13 Thread Anupam Bagchi

I have to do the following tasks on a dataset using Apache Spark with Scala as 
the programming language:   
   - Read the dataset from HDFS. A few sample lines look like this:

deviceid,bytes,eventdate
15590657,246620,20150630
14066921,1907,20150621
14066921,1906,20150626
6522013,2349,20150626
6522013,2525,20150613
   
   - Group the data by device id. Thus we now have a map of deviceid = 
(bytes,eventdate)
   - For each device, sort the set by eventdate. We now have an ordered set of 
bytes based on eventdate for each device.
   - Pick the last 30 days of bytes from this ordered set.
   - Find the moving average of bytes for the last date using a time period of 
30.
   - Find the standard deviation of the bytes for the final date using a time 
period of 30.
   - Return two values in the result (mean - kstddev) and (mean + kstddev) 
[Assume k = 3]
I am using Apache Spark 1.3.0. The actual dataset is wider, and it has to run 
on a billion rows finally.Here is the data structure for the dataset.package 
com.testing
case class DeviceAggregates (
device_id: Integer,
bytes: Long,
eventdate: Integer
   ) extends Ordered[DailyDeviceAggregates] {
  def compare(that: DailyDeviceAggregates): Int = {
eventdate - that.eventdate
  }
}
object DeviceAggregates {
  def parseLogLine(logline: String): DailyDeviceAggregates = {
val c = logline.split(,)
DailyDeviceAggregates(c(0).toInt, c(1).toLong, c(2).toInt)
  }
}The DeviceAnalyzer class looks like this:I have a very crude implementation 
that does the job, but it is not up to the mark. Sorry, I am very new to 
Scala/Spark, so my questions are quite basic. Here is what I have now:
import com.testing.DailyDeviceAggregates
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}
import org.apache.spark.mllib.linalg.{Vector, Vectors}

import scala.util.Sorting

object DeviceAnalyzer {
  def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName(Device Analyzer)
val sc = new SparkContext(sparkConf)

val logFile = args(0)

val deviceAggregateLogs = 
sc.textFile(logFile).map(DailyDeviceAggregates.parseLogLine).cache()

// Calculate statistics based on bytes
val deviceIdsMap = deviceAggregateLogs.groupBy(_.device_id)

deviceIdsMap.foreach(a = {
  val device_id = a._1  // This is the device ID
  val allaggregates = a._2  // This is an array of all device-aggregates 
for this device

  println(allaggregates)
  Sorting.quickSort(allaggregates.toArray) // Sort the CompactBuffer of 
DailyDeviceAggregates based on eventdate
  println(allaggregates) // This does not work - results are not sorted !!

  val byteValues = allaggregates.map(dda = dda.bytes.toDouble).toArray 
  val count = byteValues.count(A = true)
  val sum = byteValues.sum
  val xbar = sum / count
  val sum_x_minus_x_bar_square = byteValues.map(x = (x-xbar)*(x-xbar)).sum
  val stddev = math.sqrt(sum_x_minus_x_bar_square / count)

  val vector: Vector = Vectors.dense(byteValues)
  println(vector)
  println(device_id + , + xbar + , + stddev)

  //val vector: Vector = Vectors.dense(byteValues)
  //println(vector)
  //val summary: MultivariateStatisticalSummary = 
Statistics.colStats(vector)
})

sc.stop()
  }
}I would really appreciate if someone can suggests improvements for the 
following:   
   - The call to Sorting.quicksort is not working. Perhaps I am calling it the 
wrong way.
   - I would like to use the Spark mllib class MultivariateStatisticalSummary 
to calculate the statistical values.
   - For that I would need to keep all my intermediate values as RDD so that I 
can directly use the RDD methods to do the job.
   - At the end I also need to write the results to HDFS for which there is a 
method provided on the RDD class to do so, which is another reason I would like 
to retain everything as RDD.

Thanks in advance for your help.
Anupam Bagchi

Re: Finding moving average using Spark and Scala

2015-07-13 Thread Feynman Liang


 The call to Sorting.quicksort is not working. Perhaps I am calling it the
 wrong way.

allaggregates.toArray allocates and creates a new array separate from
allaggregates which is sorted by Sorting.quickSort; allaggregates. Try:
val sortedAggregates = allaggregates.toArray
Sorting.quickSort(sortedAggregates)

 I would like to use the Spark mllib class MultivariateStatisticalSummary
 to calculate the statistical values.

MultivariateStatisticalSummary is a trait (similar to a Java interface);
you probably want to use MultivariateOnlineSummarizer.

 For that I would need to keep all my intermediate values as RDD so that I
 can directly use the RDD methods to do the job.

Correct; you would do an aggregate using the add and merge functions
provided by MultivariateOnlineSummarizer

 At the end I also need to write the results to HDFS for which there is a
 method provided on the RDD class to do so, which is another reason I would
 like to retain everything as RDD.

You can write the RDD[(device_id, MultivariateOnlineSummarizer)] to HDFS,
or you could unpack the relevant statistics from
MultivariateOnlineSummarizer into an array/tuple using a mapValues first
and then write.

On Mon, Jul 13, 2015 at 10:07 AM, Anupam Bagchi 
anupam_bag...@rocketmail.com wrote:

 I have to do the following tasks on a dataset using Apache Spark with
 Scala as the programming language:

1. Read the dataset from HDFS. A few sample lines look like this:

  
 deviceid,bytes,eventdate15590657,246620,2015063014066921,1907,2015062114066921,1906,201506266522013,2349,201506266522013,2525,20150613


1. Group the data by device id. Thus we now have a map of deviceid =
(bytes,eventdate)
2. For each device, sort the set by eventdate. We now have an ordered
set of bytes based on eventdate for each device.
3. Pick the last 30 days of bytes from this ordered set.
4. Find the moving average of bytes for the last date using a time
period of 30.
5. Find the standard deviation of the bytes for the final date using a
time period of 30.
6. Return two values in the result (mean - k*stddev) and (mean + k*stddev)
[Assume k = 3]

 I am using Apache Spark 1.3.0. The actual dataset is wider, and it has to
 run on a billion rows finally.
 Here is the data structure for the dataset.

 package com.testingcase class DeviceAggregates (
 device_id: Integer,
 bytes: Long,
 eventdate: Integer
) extends Ordered[DailyDeviceAggregates] {
   def compare(that: DailyDeviceAggregates): Int = {
 eventdate - that.eventdate
   }}object DeviceAggregates {
   def parseLogLine(logline: String): DailyDeviceAggregates = {
 val c = logline.split(,)
 DailyDeviceAggregates(c(0).toInt, c(1).toLong, c(2).toInt)
   }}

 The DeviceAnalyzer class looks like this:
 I have a very crude implementation that does the job, but it is not up to
 the mark. Sorry, I am very new to Scala/Spark, so my questions are quite
 basic. Here is what I have now:

 import com.testing.DailyDeviceAggregatesimport 
 org.apache.spark.{SparkContext, SparkConf}import 
 org.apache.spark.mllib.linalg.Vectorimport 
 org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, 
 Statistics}import org.apache.spark.mllib.linalg.{Vector, Vectors}
 import scala.util.Sorting
 object DeviceAnalyzer {
   def main(args: Array[String]) {
 val sparkConf = new SparkConf().setAppName(Device Analyzer)
 val sc = new SparkContext(sparkConf)

 val logFile = args(0)

 val deviceAggregateLogs = 
 sc.textFile(logFile).map(DailyDeviceAggregates.parseLogLine).cache()

 // Calculate statistics based on bytes
 val deviceIdsMap = deviceAggregateLogs.groupBy(_.device_id)

 deviceIdsMap.foreach(a = {
   val device_id = a._1  // This is the device ID
   val allaggregates = a._2  // This is an array of all device-aggregates 
 for this device

   println(allaggregates)
   Sorting.quickSort(allaggregates.toArray) // Sort the CompactBuffer of 
 DailyDeviceAggregates based on eventdate
   println(allaggregates) // This does not work - results are not sorted !!

   val byteValues = allaggregates.map(dda = dda.bytes.toDouble).toArray
   val count = byteValues.count(A = true)
   val sum = byteValues.sum
   val xbar = sum / count
   val sum_x_minus_x_bar_square = byteValues.map(x = 
 (x-xbar)*(x-xbar)).sum
   val stddev = math.sqrt(sum_x_minus_x_bar_square / count)

   val vector: Vector = Vectors.dense(byteValues)
   println(vector)
   println(device_id + , + xbar + , + stddev)

   //val vector: Vector = Vectors.dense(byteValues)
   //println(vector)
   //val summary: MultivariateStatisticalSummary = 
 Statistics.colStats(vector)
 })

 sc.stop()
   }}

 I would really appreciate if someone can suggests improvements for the
 following:

1. The call to Sorting.quicksort is not working. Perhaps I am

Re: Finding moving average using Spark and Scala

2015-07-13 Thread Feynman Liang

A good example is RegressionMetrics
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/evaluation/RegressionMetrics.scala#L48's
use of of OnlineMultivariateSummarizer to aggregate statistics across
labels and residuals; take a look at how aggregateByKey is used there.

On Mon, Jul 13, 2015 at 4:50 PM, Anupam Bagchi anupam_bag...@rocketmail.com
 wrote:

 Thank you Feynman for your response. Since I am very new to Scala I may
 need a bit more hand-holding at this stage.

 I have been able to incorporate your suggestion about sorting - and it now
 works perfectly. Thanks again for that.

 I tried to use your suggestion of using MultiVariateOnlineSummarizer, but
 could not proceed further. For each deviceid (the key) my goal is to get a
 vector of doubles on which I can query the mean and standard deviation. Now
 because RDDs are immutable, I cannot use a foreach loop to interate through
 the groupby results and individually add the values in an RDD - Spark does
 not allow that. I need to apply the RDD functions directly on the entire
 set to achieve the transformations I need. This is where I am faltering
 since I am not used to the lambda expressions that Scala uses.

 object DeviceAnalyzer {
   def main(args: Array[String]) {
 val sparkConf = new SparkConf().setAppName(Device Analyzer)
 val sc = new SparkContext(sparkConf)

 val logFile = args(0)

 val deviceAggregateLogs = 
 sc.textFile(logFile).map(DailyDeviceAggregates.parseLogLine).cache()

 // Calculate statistics based on bytes
 val deviceIdsMap = deviceAggregateLogs.groupBy(_.device_id)

 // Question: Can we not write the line above as 
 deviceAggregateLogs.groupBy(_.device_id).sortBy(c = c_.2, true) // Anything 
 wrong?

 // All I need to do below is collect the vector of bytes for each device 
 and store it in the RDD

 // The problem with the ‘foreach' approach below, is that it generates 
 the vector values one at a time, which I cannot

 // add individually to an immutable RDD

 deviceIdsMap.foreach(a = {
   val device_id = a._1  // This is the device ID
   val allaggregates = a._2  // This is an array of all device-aggregates 
 for this device

   val sortedaggregates = allaggregates.toArray  
 Sorting.quickSort(sortedaggregates)

   val byteValues = sortedaggregates.map(dda = dda.bytes.toDouble).toArray
   val count = byteValues.count(A = true)
   val sum = byteValues.sum
   val xbar = sum / count
   val sum_x_minus_x_bar_square = byteValues.map(x = 
 (x-xbar)*(x-xbar)).sum
   val stddev = math.sqrt(sum_x_minus_x_bar_square / count)

   val vector: Vector = Vectors.dense(byteValues)
   println(vector)
   println(device_id + , + xbar + , + stddev)
 })

   //val vector: Vector = Vectors.dense(byteValues)
   //println(vector)
   //val summary: MultivariateStatisticalSummary = 
 Statistics.colStats(vector)


 sc.stop() } }

 Can you show me how to write the ‘foreach’ loop in a Spark-friendly way?
 Thanks a lot for your help.

 Anupam Bagchi


 On Jul 13, 2015, at 12:21 PM, Feynman Liang fli...@databricks.com wrote:

 The call to Sorting.quicksort is not working. Perhaps I am calling it the
 wrong way.

 allaggregates.toArray allocates and creates a new array separate from
 allaggregates which is sorted by Sorting.quickSort; allaggregates. Try:
 val sortedAggregates = allaggregates.toArray
 Sorting.quickSort(sortedAggregates)

 I would like to use the Spark mllib class MultivariateStatisticalSummary
 to calculate the statistical values.

 MultivariateStatisticalSummary is a trait (similar to a Java interface);
 you probably want to use MultivariateOnlineSummarizer.

 For that I would need to keep all my intermediate values as RDD so that I
 can directly use the RDD methods to do the job.

 Correct; you would do an aggregate using the add and merge functions
 provided by MultivariateOnlineSummarizer

 At the end I also need to write the results to HDFS for which there is a
 method provided on the RDD class to do so, which is another reason I would
 like to retain everything as RDD.

 You can write the RDD[(device_id, MultivariateOnlineSummarizer)] to HDFS,
 or you could unpack the relevant statistics from
 MultivariateOnlineSummarizer into an array/tuple using a mapValues first
 and then write.

 On Mon, Jul 13, 2015 at 10:07 AM, Anupam Bagchi 
 anupam_bag...@rocketmail.com wrote:

 I have to do the following tasks on a dataset using Apache Spark with
 Scala as the programming language:

1. Read the dataset from HDFS. A few sample lines look like this:

  
 deviceid,bytes,eventdate15590657,246620,2015063014066921,1907,2015062114066921,1906,201506266522013,2349,201506266522013,2525,20150613


1. Group the data by device id. Thus we now have a map of deviceid =
(bytes,eventdate)
2. For each device, sort the set by eventdate. We now have an ordered
set of bytes based on eventdate

Re: Finding moving average using Spark and Scala

2015-07-13 Thread Anupam Bagchi

Thank you Feynman for your response. Since I am very new to Scala I may need a 
bit more hand-holding at this stage.

I have been able to incorporate your suggestion about sorting - and it now 
works perfectly. Thanks again for that.

I tried to use your suggestion of using MultiVariateOnlineSummarizer, but could 
not proceed further. For each deviceid (the key) my goal is to get a vector of 
doubles on which I can query the mean and standard deviation. Now because RDDs 
are immutable, I cannot use a foreach loop to interate through the groupby 
results and individually add the values in an RDD - Spark does not allow that. 
I need to apply the RDD functions directly on the entire set to achieve the 
transformations I need. This is where I am faltering since I am not used to the 
lambda expressions that Scala uses.

object DeviceAnalyzer {
  def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName(Device Analyzer)
val sc = new SparkContext(sparkConf)

val logFile = args(0)

val deviceAggregateLogs = 
sc.textFile(logFile).map(DailyDeviceAggregates.parseLogLine).cache()

// Calculate statistics based on bytes
val deviceIdsMap = deviceAggregateLogs.groupBy(_.device_id)
// Question: Can we not write the line above as 
deviceAggregateLogs.groupBy(_.device_id).sortBy(c = c_.2, true) // Anything 
wrong?
// All I need to do below is collect the vector of bytes for each device 
and store it in the RDD
// The problem with the ‘foreach' approach below, is that it generates the 
vector values one at a time, which I cannot 
// add individually to an immutable RDD
deviceIdsMap.foreach(a = {
  val device_id = a._1  // This is the device ID
  val allaggregates = a._2  // This is an array of all device-aggregates 
for this device

  val sortedaggregates = allaggregates.toArray
  Sorting.quickSort(sortedaggregates)

  val byteValues = sortedaggregates.map(dda = dda.bytes.toDouble).toArray 
  val count = byteValues.count(A = true)
  val sum = byteValues.sum
  val xbar = sum / count
  val sum_x_minus_x_bar_square = byteValues.map(x = (x-xbar)*(x-xbar)).sum
  val stddev = math.sqrt(sum_x_minus_x_bar_square / count)

  val vector: Vector = Vectors.dense(byteValues)
  println(vector)
  println(device_id + , + xbar + , + stddev)

})
  //val vector: Vector = Vectors.dense(byteValues)
  //println(vector)
  //val summary: MultivariateStatisticalSummary = 
Statistics.colStats(vector)


sc.stop()
  }
}
Can you show me how to write the ‘foreach’ loop in a Spark-friendly way? Thanks 
a lot for your help.

Anupam Bagchi


 On Jul 13, 2015, at 12:21 PM, Feynman Liang fli...@databricks.com wrote:
 
 The call to Sorting.quicksort is not working. Perhaps I am calling it the 
 wrong way.
 allaggregates.toArray allocates and creates a new array separate from 
 allaggregates which is sorted by Sorting.quickSort; allaggregates. Try:
 val sortedAggregates = allaggregates.toArray
 Sorting.quickSort(sortedAggregates)
 I would like to use the Spark mllib class MultivariateStatisticalSummary to 
 calculate the statistical values.
 MultivariateStatisticalSummary is a trait (similar to a Java interface); you 
 probably want to use MultivariateOnlineSummarizer. 
 For that I would need to keep all my intermediate values as RDD so that I can 
 directly use the RDD methods to do the job.
 Correct; you would do an aggregate using the add and merge functions provided 
 by MultivariateOnlineSummarizer 
 At the end I also need to write the results to HDFS for which there is a 
 method provided on the RDD class to do so, which is another reason I would 
 like to retain everything as RDD.
 You can write the RDD[(device_id, MultivariateOnlineSummarizer)] to HDFS, or 
 you could unpack the relevant statistics from MultivariateOnlineSummarizer 
 into an array/tuple using a mapValues first and then write.
 
 On Mon, Jul 13, 2015 at 10:07 AM, Anupam Bagchi anupam_bag...@rocketmail.com 
 mailto:anupam_bag...@rocketmail.com wrote:
 I have to do the following tasks on a dataset using Apache Spark with Scala 
 as the programming language:
 Read the dataset from HDFS. A few sample lines look like this:
 deviceid,bytes,eventdate
 15590657,246620,20150630
 14066921,1907,20150621
 14066921,1906,20150626
 6522013,2349,20150626
 6522013,2525,20150613
 Group the data by device id. Thus we now have a map of deviceid = 
 (bytes,eventdate)
 For each device, sort the set by eventdate. We now have an ordered set of 
 bytes based on eventdate for each device.
 Pick the last 30 days of bytes from this ordered set.
 Find the moving average of bytes for the last date using a time period of 30.
 Find the standard deviation of the bytes for the final date using a time 
 period of 30.
 Return two values in the result (mean - kstddev) and (mean + kstddev) [Assume 
 k = 3]
 I am using Apache Spark 1.3.0. The actual dataset is wider, and it has to run

Calculating moving average of dataset in Apache Spark and Scala

2015-07-12 Thread Anupam Bagchi

I have to do the following tasks on a dataset using Apache Spark with Scala as 
the programming language:

Read the dataset from HDFS. A few sample lines look like this:
deviceid,bytes,eventdate
15590657,246620,20150630
14066921,1907,20150621
14066921,1906,20150626
6522013,2349,20150626
6522013,2525,20150613
Group the data by device id. Thus we now have a map of deviceid = 
(bytes,eventdate)

For each device, sort the set by eventdate. We now have an ordered set of bytes 
based on eventdate for each device.

Pick the last 30 days of bytes from this ordered set.

Find the moving average of bytes for the last date using a time period of 30.

Find the standard deviation of the bytes for the final date using a time period 
of 30.

Return two values in the result (mean - k * stddev) and (mean + k * stddev) 
[Assume k = 3]

I am using Apache Spark 1.3.0. The actual dataset is wider, and it has to run 
on a billion rows finally.


Here is the data structure for the dataset.

package com.testing
case class DeviceAggregates (
device_id: Integer,
bytes: Long,
eventdate: Integer
   ) extends Ordered[DailyDeviceAggregates] {
  def compare(that: DailyDeviceAggregates): Int = {
eventdate - that.eventdate
  }
}
object DeviceAggregates {
  def parseLogLine(logline: String): DailyDeviceAggregates = {
val c = logline.split(,)
DailyDeviceAggregates(c(0).toInt, c(1).toLong, c(2).toInt)
  }
}

I have a very crude implementation that does the job, but it is not up to the 
mark. Sorry, I am very new to Scala/Spark, so my questions are quite basic. 
Here is what I have now:

import com.testing.DailyDeviceAggregates
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}
import org.apache.spark.mllib.linalg.{Vector, Vectors}

import scala.util.Sorting

object DeviceAnalyzer {
  def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName(Device Analyzer)
val sc = new SparkContext(sparkConf)

val logFile = args(0)

val deviceAggregateLogs = 
sc.textFile(logFile).map(DailyDeviceAggregates.parseLogLine).cache()

// Calculate statistics based on bytes
val deviceIdsMap = deviceAggregateLogs.groupBy(_.device_id)

deviceIdsMap.foreach(a = {
  val device_id = a._1  // This is the device ID
  val allaggregates = a._2  // This is an array of all device-aggregates 
for this device

  println(allaggregates)
  Sorting.quickSort(allaggregates.toArray) // Sort the CompactBuffer of 
DailyDeviceAggregates based on eventdate
  println(allAggregates) // This does not work - I can see that values are 
not sorted

  val byteValues = allaggregates.map(dda = dda.bytes.toDouble).toArray
  val count = byteValues.count(A = true)
  val sum = byteValues.sum
  val xbar = sum / count
  val sum_x_minus_x_bar_square = byteValues.map(x = (x-xbar)*(x-xbar)).sum
  val stddev = math.sqrt(sum_x_minus_x_bar_square / count)

  val vector: Vector = Vectors.dense(byteValues)
  println(vector)
  println(device_id + , + xbar + , + stddev)

  //val vector: Vector = Vectors.dense(byteValues)
  //println(vector)
  //val summary: MultivariateStatisticalSummary = 
Statistics.colStats(vector)  // The way I should be calling it
})

sc.stop()
  }
}

I would really appreciate if someone can suggests improvements for the 
following:

The call to Sorting.quicksort is not working. Perhaps I am calling it the wrong 
way.
I would like to use the Spark mllib class MultivariateStatisticalSummary to 
calculate the statistical values.
For that I would need to keep all my intermediate values as RDD so that I can 
directly use the RDD methods to do the job.
At the end I also need to write the results to HDFS for which there is a method 
provided on the RDD class to do so, which is another reason I would like to 
retain everything as RDD.


Anupam Bagchi

Moving average using Spark and Scala

2015-07-12 Thread Anupam Bagchi

I have to do the following tasks on a dataset using Apache Spark with Scala as 
the programming language:

Read the dataset from HDFS. A few sample lines look like this:
deviceid,bytes,eventdate
15590657,246620,20150630
14066921,1907,20150621
14066921,1906,20150626
6522013,2349,20150626
6522013,2525,20150613
Group the data by device id. Thus we now have a map of deviceid = 
(bytes,eventdate)

For each device, sort the set by eventdate. We now have an ordered set of bytes 
based on eventdate for each device.

Pick the last 30 days of bytes from this ordered set.

Find the moving average of bytes for the last date using a time period of 30.

Find the standard deviation of the bytes for the final date using a time period 
of 30.

Return two values in the result (mean - kstddev) and (mean + kstddev) [Assume k 
= 3]

I am using Apache Spark 1.3.0. The actual dataset is wider, and it has to run 
on a billion rows finally.

Here is the data structure for the dataset.

package com.testing
case class DeviceAggregates (
device_id: Integer,
bytes: Long,
eventdate: Integer
   ) extends Ordered[DailyDeviceAggregates] {
  def compare(that: DailyDeviceAggregates): Int = {
eventdate - that.eventdate
  }
}
object DeviceAggregates {
  def parseLogLine(logline: String): DailyDeviceAggregates = {
val c = logline.split(,)
DailyDeviceAggregates(c(0).toInt, c(1).toLong, c(2).toInt)
  }
}
The DeviceAnalyzer class looks like this:

I have a very crude implementation that does the job, but it is not up to the 
mark. Sorry, I am very new to Scala/Spark, so my questions are quite basic. 
Here is what I have now:

import com.testing.DailyDeviceAggregates
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}
import org.apache.spark.mllib.linalg.{Vector, Vectors}

import scala.util.Sorting

object DeviceAnalyzer {
  def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName(Device Analyzer)
val sc = new SparkContext(sparkConf)

val logFile = args(0)

val deviceAggregateLogs = 
sc.textFile(logFile).map(DailyDeviceAggregates.parseLogLine).cache()

// Calculate statistics based on bytes
val deviceIdsMap = deviceAggregateLogs.groupBy(_.device_id)

deviceIdsMap.foreach(a = {
  val device_id = a._1  // This is the device ID
  val allaggregates = a._2  // This is an array of all device-aggregates 
for this device

  println(allaggregates)
  Sorting.quickSort(allaggregates.toArray) // Sort the CompactBuffer of 
DailyDeviceAggregates based on eventdate
  println(allaggregates) // This does not work - results are not sorted !!

  val byteValues = allaggregates.map(dda = dda.bytes.toDouble).toArray 
  val count = byteValues.count(A = true)
  val sum = byteValues.sum
  val xbar = sum / count
  val sum_x_minus_x_bar_square = byteValues.map(x = (x-xbar)*(x-xbar)).sum
  val stddev = math.sqrt(sum_x_minus_x_bar_square / count)

  val vector: Vector = Vectors.dense(byteValues)
  println(vector)
  println(device_id + , + xbar + , + stddev)

  //val vector: Vector = Vectors.dense(byteValues)
  //println(vector)
  //val summary: MultivariateStatisticalSummary = 
Statistics.colStats(vector)
})

sc.stop()
  }
}
I would really appreciate if someone can suggests improvements for the 
following:

The call to Sorting.quicksort is not working. Perhaps I am calling it the wrong 
way.
I would like to use the Spark mllib class MultivariateStatisticalSummary to 
calculate the statistical values.
For that I would need to keep all my intermediate values as RDD so that I can 
directly use the RDD methods to do the job.
At the end I also need to write the results to HDFS for which there is a method 
provided on the RDD class to do so, which is another reason I would like to 
retain everything as RDD.

Thanks in advance for your help.

Anupam Bagchi

Re: External JARs not loading Spark Shell Scala 2.11

2015-04-17 Thread Michael Allman

FWIW, this is an essential feature to our use of Spark, and I'm surprised it's not advertised clearly as a limitation in the documentation. All I've found about running Spark 1.3 on 2.11 is here:http://spark.apache.org/docs/latest/building-spark.html#building-for-scala-211Also, I'm experiencing some serious stability problems simply trying to run the Spark 1.3 Scala 2.11 REPL. Most of the time it fails to load and spews a torrent of compiler assertion failures, etc. See attached.spark@dp-cluster-master-node-001:~/spark/bin$ spark-shell
Spark Command: java -cp 
/opt/spark/conf:/opt/spark/lib/spark-assembly-1.3.2-SNAPSHOT-hadoop2.5.0-cdh5.3.3.jar:/etc/hadoop/conf:/opt/spark/lib/jline-2.12.jar
 -Dscala.usejavacp=true -Xms512m -Xmx512m org.apache.spark.deploy.SparkSubmit 
--class org.apache.spark.repl.Main spark-shell


Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.3.1
  /_/
 
Using Scala version 2.11.2 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_40)
Type in expressions to have them evaluated.
Type :help for more information.
Spark context available as sc.
Exception in thread main java.lang.AssertionError: assertion failed: parser: 
(source: String, options: Map[String,String])org.apache.spark.sql.DataFrame, 
tailcalls: (source: String, options: 
scala.collection.immutable.Map[String,String])org.apache.spark.sql.DataFrame, 
tailcalls: (source: String, options: 
scala.collection.immutable.Map)org.apache.spark.sql.DataFrame
at scala.reflect.internal.Symbols$TypeHistory.init(Symbols.scala:3601)
at scala.reflect.internal.Symbols$Symbol.rawInfo(Symbols.scala:1521)
at scala.reflect.internal.Symbols$Symbol.info(Symbols.scala:1439)
at 
scala.tools.nsc.transform.SpecializeTypes$$anonfun$23$$anonfun$apply$20.apply(SpecializeTypes.scala:775)
at 
scala.tools.nsc.transform.SpecializeTypes$$anonfun$23$$anonfun$apply$20.apply(SpecializeTypes.scala:768)
at scala.collection.immutable.List.flatMap(List.scala:327)
at 
scala.tools.nsc.transform.SpecializeTypes$$anonfun$23.apply(SpecializeTypes.scala:768)
at 
scala.tools.nsc.transform.SpecializeTypes$$anonfun$23.apply(SpecializeTypes.scala:766)
at scala.collection.immutable.List.flatMap(List.scala:327)
at 
scala.tools.nsc.transform.SpecializeTypes.specializeClass(SpecializeTypes.scala:766)
at 
scala.tools.nsc.transform.SpecializeTypes.transformInfo(SpecializeTypes.scala:1187)
at 
scala.tools.nsc.transform.InfoTransform$Phase$$anon$1.transform(InfoTransform.scala:38)
at scala.reflect.internal.Symbols$Symbol.rawInfo(Symbols.scala:1519)
at scala.reflect.internal.Symbols$Symbol.info(Symbols.scala:1439)
at scala.reflect.internal.Symbols$Symbol.isDerivedValueClass(Symbols.scala:775)
at scala.reflect.internal.transform.Erasure$ErasureMap.apply(Erasure.scala:131)
at scala.reflect.internal.transform.Erasure$ErasureMap.apply(Erasure.scala:144)
at 
scala.reflect.internal.transform.Erasure$class.specialErasure(Erasure.scala:209)
at scala.tools.nsc.transform.Erasure.specialErasure(Erasure.scala:15)
at 
scala.reflect.internal.transform.Erasure$class.transformInfo(Erasure.scala:364)
at scala.tools.nsc.transform.Erasure.transformInfo(Erasure.scala:348)
at 
scala.tools.nsc.transform.InfoTransform$Phase$$anon$1.transform(InfoTransform.scala:38)
at scala.reflect.internal.Symbols$Symbol.rawInfo(Symbols.scala:1519)
at scala.reflect.internal.Symbols$Symbol.info(Symbols.scala:1439)
at 
scala.tools.nsc.transform.Erasure$ErasureTransformer$$anonfun$checkNoDeclaredDoubleDefs$1$$anonfun$apply$mcV$sp$2.apply(Erasure.scala:753)
at 
scala.tools.nsc.transform.Erasure$ErasureTransformer$$anonfun$checkNoDeclaredDoubleDefs$1$$anonfun$apply$mcV$sp$2.apply(Erasure.scala:753)
at scala.reflect.internal.Scopes$Scope.foreach(Scopes.scala:373)
at 
scala.tools.nsc.transform.Erasure$ErasureTransformer$$anonfun$checkNoDeclaredDoubleDefs$1.apply(Erasure.scala:753)
at 
scala.tools.nsc.transform.Erasure$ErasureTransformer$$anonfun$checkNoDeclaredDoubleDefs$1.apply(Erasure.scala:753)
at scala.reflect.internal.SymbolTable.enteringPhase(SymbolTable.scala:235)
at scala.reflect.internal.SymbolTable.exitingPhase(SymbolTable.scala:256)
at 
scala.tools.nsc.transform.Erasure$ErasureTransformer.checkNoDeclaredDoubleDefs(Erasure.scala:753)
at 
scala.tools.nsc.transform.Erasure$ErasureTransformer.scala$tools$nsc$transform$Erasure$ErasureTransformer$$checkNoDoubleDefs(Erasure.scala:780)
at 
scala.tools.nsc.transform.Erasure$ErasureTransformer$$anon$1.preErase(Erasure.scala:1074)
at 
scala.tools.nsc.transform.Erasure$ErasureTransformer$$anon$1.transform(Erasure.scala:1109)
at 
scala.tools.nsc.transform.Erasure$ErasureTransformer$$anon$1.transform(Erasure.scala:841)
at scala.reflect.api.Trees$Transformer.transformTemplate(Trees.scala:2563)
at scala.reflect.internal.Trees$$anonfun$itransform$4.apply(Trees.scala:1401)
at scala.reflect.internal.Trees$$anonfun$itransform$4.apply(Trees.scala:1400

Re: External JARs not loading Spark Shell Scala 2.11

2015-04-17 Thread Sean Owen

Doesn't this reduce to Scala isn't compatible with itself across
maintenance releases? Meaning, if this were fixed then Scala
2.11.{x 6} would have similar failures. It's not not-ready; it's
just not the Scala 2.11.6 REPL. Still, sure I'd favor breaking the
unofficial support to at least make the latest Scala 2.11 the unbroken
one.

On Fri, Apr 17, 2015 at 7:58 AM, Michael Allman mich...@videoamp.com wrote:
FWIW, this is an essential feature to our use of Spark, and I'm surprised
it's not advertised clearly as a limitation in the documentation. All I've
found about running Spark 1.3 on 2.11 is here:

http://spark.apache.org/docs/latest/building-spark.html#building-for-scala-211

Also, I'm experiencing some serious stability problems simply trying to run
the Spark 1.3 Scala 2.11 REPL. Most of the time it fails to load and spews a
torrent of compiler assertion failures, etc. See attached.

Unfortunately, it appears the Spark 1.3 Scala 2.11 REPL is simply not ready
for production use. I was going to file a bug, but it seems clear that the
current implementation is going to need to be forward-ported to Scala 2.11.6
anyway. We already have an issue for that:

https://issues.apache.org/jira/browse/SPARK-6155

Michael

On Apr 9, 2015, at 10:29 PM, Prashant Sharma scrapco...@gmail.com wrote:

You will have to go to this commit ID
191d7cf2a655d032f160b9fa181730364681d0e7 in Apache spark. [1] Once you are
at that commit, you need to review the changes done to the repl code and
look for the relevant occurrences of the same code in scala 2.11 repl source
and somehow make it all work.

Thanks,

1. http://githowto.com/getting_old_versions

Prashant Sharma

On Thu, Apr 9, 2015 at 4:40 PM, Alex Nakos ana...@gmail.com wrote:

Ok, what do i need to do in order to migrate the patch?

Thanks
Alex

On Thu, Apr 9, 2015 at 11:54 AM, Prashant Sharma scrapco...@gmail.com
wrote:

This is the jira I referred to
https://issues.apache.org/jira/browse/SPARK-3256. Another reason for not
working on it is evaluating priority between upgrading to scala 2.11.5(it is
non trivial I suppose because repl has changed a bit) or migrating that
patch is much simpler.

Prashant Sharma

On Thu, Apr 9, 2015 at 4:16 PM, Alex Nakos ana...@gmail.com wrote:

Hi-

Was this the JIRA issue?
https://issues.apache.org/jira/browse/SPARK-2988

Any help in getting this working would be much appreciated!

Thanks
Alex

On Thu, Apr 9, 2015 at 11:32 AM, Prashant Sharma scrapco...@gmail.com
wrote:

You are right this needs to be done. I can work on it soon, I was not
sure if there is any one even using scala 2.11 spark repl. Actually there
is
a patch in scala 2.10 shell to support adding jars (Lost the JIRA ID),
which
has to be ported for scala 2.11 too. If however, you(or anyone else) are
planning to work, I can help you ?

Prashant Sharma

On Thu, Apr 9, 2015 at 3:08 PM, anakos ana...@gmail.com wrote:

Hi-

I am having difficulty getting the 1.3.0 Spark shell to find an
external
jar. I have build Spark locally for Scala 2.11 and I am starting the
REPL
as follows:

bin/spark-shell --master yarn --jars data-api-es-data-export-4.0.0.jar

I see the following line in the console output:

15/04/09 09:52:15 INFO spark.SparkContext: Added JAR

file:/opt/spark/spark-1.3.0_2.11-hadoop2.3/data-api-es-data-export-4.0.0.jar
at http://192.168.115.31:54421/jars/data-api-es-data-export-4.0.0.jar
with
timestamp 1428569535904

but when i try to import anything from this jar, it's simply not
available.
When I try to add the jar manually using the command

:cp /path/to/jar

the classes in the jar are still unavailable. I understand that 2.11
is not
officially supported, but has anyone been able to get an external jar
loaded
in the 1.3.0 release? Is this a known issue? I have tried searching
around
for answers but the only thing I've found that may be related is this:

https://issues.apache.org/jira/browse/SPARK-3257

Any/all help is much appreciated.
Thanks
Alex

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/External-JARs-not-loading-Spark-Shell-Scala-2-11-tp22434.html
Sent from the Apache Spark User List mailing list archive at
Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: External JARs not loading Spark Shell Scala 2.11

2015-04-17 Thread Michael Allman

H... I don't follow. The 2.11.x series is supposed to be binary compatible
against user code. Anyway, I was building Spark against 2.11.2 and still saw
the problems with the REPL. I've created a bug report:

https://issues.apache.org/jira/browse/SPARK-6989
https://issues.apache.org/jira/browse/SPARK-6989

I hope this helps.

Cheers,

Michael

On Apr 17, 2015, at 1:41 AM, Sean Owen so...@cloudera.com wrote:

http://spark.apache.org/docs/latest/building-spark.html#building-for-scala-211

https://issues.apache.org/jira/browse/SPARK-6155

Michael

On Apr 9, 2015, at 10:29 PM, Prashant Sharma scrapco...@gmail.com wrote:

Thanks,

1. http://githowto.com/getting_old_versions

Prashant Sharma

On Thu, Apr 9, 2015 at 4:40 PM, Alex Nakos ana...@gmail.com wrote:

Ok, what do i need to do in order to migrate the patch?

Thanks
Alex

On Thu, Apr 9, 2015 at 11:54 AM, Prashant Sharma scrapco...@gmail.com
wrote:

This is the jira I referred to
https://issues.apache.org/jira/browse/SPARK-3256. Another reason for not
working on it is evaluating priority between upgrading to scala 2.11.5(it
is
non trivial I suppose because repl has changed a bit) or migrating that
patch is much simpler.

Prashant Sharma

On Thu, Apr 9, 2015 at 4:16 PM, Alex Nakos ana...@gmail.com wrote:

Hi-

Was this the JIRA issue?
https://issues.apache.org/jira/browse/SPARK-2988

Any help in getting this working would be much appreciated!

Thanks
Alex

On Thu, Apr 9, 2015 at 11:32 AM, Prashant Sharma scrapco...@gmail.com
wrote:

You are right this needs to be done. I can work on it soon, I was not
sure if there is any one even using scala 2.11 spark repl. Actually
there is
a patch in scala 2.10 shell to support adding jars (Lost the JIRA ID),
which
has to be ported for scala 2.11 too. If however, you(or anyone else) are
planning to work, I can help you ?

Prashant Sharma

On Thu, Apr 9, 2015 at 3:08 PM, anakos ana...@gmail.com wrote:

Hi-

I am having difficulty getting the 1.3.0 Spark shell to find an
external
jar. I have build Spark locally for Scala 2.11 and I am starting the
REPL
as follows:

bin/spark-shell --master yarn --jars data-api-es-data-export-4.0.0.jar

I see the following line in the console output:

15/04/09 09:52:15 INFO spark.SparkContext: Added JAR

file:/opt/spark/spark-1.3.0_2.11-hadoop2.3/data-api-es-data-export-4.0.0.jar
at http://192.168.115.31:54421/jars/data-api-es-data-export-4.0.0.jar
with
timestamp 1428569535904

but when i try to import anything from this jar, it's simply not
available.
When I try to add the jar manually using the command

:cp /path/to/jar

https://issues.apache.org/jira/browse/SPARK-3257

Any/all help is much appreciated.
Thanks
Alex

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: External JARs not loading Spark Shell Scala 2.11

2015-04-17 Thread Sean Owen

You are running on 2.11.6, right? of course, it seems like that should
all work, but it doesn't work for you. My point is that the shell you
are saying doesn't work is Scala's 2.11.2 shell -- with some light
modification.

It's possible that the delta is the problem. I can't entirely make out
whether the errors are Spark-specific; they involve Spark classes in
some cases but they're assertion errors from Scala libraries.

I don't know if this shell is supposed to work even across maintenance
releases as-is, though that would be very nice. It's not an API for
Scala.

A good test of whether this idea has any merit would be to run with
Scala 2.11.2. I'll copy this to JIRA for continuation.

On Fri, Apr 17, 2015 at 10:31 PM, Michael Allman mich...@videoamp.com wrote:
 H... I don't follow. The 2.11.x series is supposed to be binary
 compatible against user code. Anyway, I was building Spark against 2.11.2
 and still saw the problems with the REPL. I've created a bug report:

 https://issues.apache.org/jira/browse/SPARK-6989

 I hope this helps.

 Cheers,

 Michael

 On Apr 17, 2015, at 1:41 AM, Sean Owen so...@cloudera.com wrote:

 Doesn't this reduce to Scala isn't compatible with itself across
 maintenance releases? Meaning, if this were fixed then Scala
 2.11.{x  6} would have similar failures. It's not not-ready; it's
 just not the Scala 2.11.6 REPL. Still, sure I'd favor breaking the
 unofficial support to at least make the latest Scala 2.11 the unbroken
 one.

 On Fri, Apr 17, 2015 at 7:58 AM, Michael Allman mich...@videoamp.com
 wrote:

 FWIW, this is an essential feature to our use of Spark, and I'm surprised
 it's not advertised clearly as a limitation in the documentation. All I've
 found about running Spark 1.3 on 2.11 is here:

 http://spark.apache.org/docs/latest/building-spark.html#building-for-scala-211

 Also, I'm experiencing some serious stability problems simply trying to run
 the Spark 1.3 Scala 2.11 REPL. Most of the time it fails to load and spews a
 torrent of compiler assertion failures, etc. See attached.



 Unfortunately, it appears the Spark 1.3 Scala 2.11 REPL is simply not ready
 for production use. I was going to file a bug, but it seems clear that the
 current implementation is going to need to be forward-ported to Scala 2.11.6
 anyway. We already have an issue for that:

 https://issues.apache.org/jira/browse/SPARK-6155

 Michael


 On Apr 9, 2015, at 10:29 PM, Prashant Sharma scrapco...@gmail.com wrote:

 You will have to go to this commit ID
 191d7cf2a655d032f160b9fa181730364681d0e7 in Apache spark. [1] Once you are
 at that commit, you need to review the changes done to the repl code and
 look for the relevant occurrences of the same code in scala 2.11 repl source
 and somehow make it all work.


 Thanks,





 1. http://githowto.com/getting_old_versions

 Prashant Sharma



 On Thu, Apr 9, 2015 at 4:40 PM, Alex Nakos ana...@gmail.com wrote:


 Ok, what do i need to do in order to migrate the patch?

 Thanks
 Alex

 On Thu, Apr 9, 2015 at 11:54 AM, Prashant Sharma scrapco...@gmail.com
 wrote:


 This is the jira I referred to
 https://issues.apache.org/jira/browse/SPARK-3256. Another reason for not
 working on it is evaluating priority between upgrading to scala 2.11.5(it is
 non trivial I suppose because repl has changed a bit) or migrating that
 patch is much simpler.

 Prashant Sharma



 On Thu, Apr 9, 2015 at 4:16 PM, Alex Nakos ana...@gmail.com wrote:


 Hi-

 Was this the JIRA issue?
 https://issues.apache.org/jira/browse/SPARK-2988

 Any help in getting this working would be much appreciated!

 Thanks
 Alex

 On Thu, Apr 9, 2015 at 11:32 AM, Prashant Sharma scrapco...@gmail.com
 wrote:


 You are right this needs to be done. I can work on it soon, I was not
 sure if there is any one even using scala 2.11 spark repl. Actually there is
 a patch in scala 2.10 shell to support adding jars (Lost the JIRA ID), which
 has to be ported for scala 2.11 too. If however, you(or anyone else) are
 planning to work, I can help you ?

 Prashant Sharma



 On Thu, Apr 9, 2015 at 3:08 PM, anakos ana...@gmail.com wrote:


 Hi-

 I am having difficulty getting the 1.3.0 Spark shell to find an
 external
 jar.  I have build Spark locally for Scala 2.11 and I am starting the
 REPL
 as follows:

 bin/spark-shell --master yarn --jars data-api-es-data-export-4.0.0.jar

 I see the following line in the console output:

 15/04/09 09:52:15 INFO spark.SparkContext: Added JAR

 file:/opt/spark/spark-1.3.0_2.11-hadoop2.3/data-api-es-data-export-4.0.0.jar
 at http://192.168.115.31:54421/jars/data-api-es-data-export-4.0.0.jar
 with
 timestamp 1428569535904

 but when i try to import anything from this jar, it's simply not
 available.
 When I try to add the jar manually using the command

 :cp /path/to/jar

 the classes in the jar are still unavailable. I understand that 2.11
 is not
 officially supported, but has anyone been able to get an external jar
 loaded
 in the 1.3.0 release

Re: External JARs not loading Spark Shell Scala 2.11

2015-04-17 Thread Michael Allman

I actually just saw your comment on SPARK-6989 before this message. So I'll
copy to the mailing list:

I'm not sure I understand what you mean about running on 2.11.6. I'm just
running the spark-shell command. It in turn is running

java -cp
/opt/spark/conf:/opt/spark/lib/spark-assembly-1.3.2-SNAPSHOT-hadoop2.5.0-cdh5.3.3.jar:/etc/hadoop/conf:/opt/spark/lib/jline-2.12.jar
\
-Dscala.usejavacp=true -Xms512m -Xmx512m
org.apache.spark.deploy.SparkSubmit --class org.apache.spark.repl.Main
spark-shell

I built Spark with the included build/mvn script. As far as I can tell, the
only reference to a specific version of Scala is in the top-level pom file, and
it says 2.11.2.

On Apr 17, 2015, at 9:57 PM, Sean Owen so...@cloudera.com wrote:

You are running on 2.11.6, right? of course, it seems like that should
all work, but it doesn't work for you. My point is that the shell you
are saying doesn't work is Scala's 2.11.2 shell -- with some light
modification.

It's possible that the delta is the problem. I can't entirely make out
whether the errors are Spark-specific; they involve Spark classes in
some cases but they're assertion errors from Scala libraries.

I don't know if this shell is supposed to work even across maintenance
releases as-is, though that would be very nice. It's not an API for
Scala.

A good test of whether this idea has any merit would be to run with
Scala 2.11.2. I'll copy this to JIRA for continuation.

On Fri, Apr 17, 2015 at 10:31 PM, Michael Allman mich...@videoamp.com wrote:
H... I don't follow. The 2.11.x series is supposed to be binary
compatible against user code. Anyway, I was building Spark against 2.11.2
and still saw the problems with the REPL. I've created a bug report:

https://issues.apache.org/jira/browse/SPARK-6989

I hope this helps.

Cheers,

Michael

On Apr 17, 2015, at 1:41 AM, Sean Owen so...@cloudera.com wrote:

On Fri, Apr 17, 2015 at 7:58 AM, Michael Allman mich...@videoamp.com
wrote:

FWIW, this is an essential feature to our use of Spark, and I'm surprised
it's not advertised clearly as a limitation in the documentation. All I've
found about running Spark 1.3 on 2.11 is here:

http://spark.apache.org/docs/latest/building-spark.html#building-for-scala-211

https://issues.apache.org/jira/browse/SPARK-6155

Michael

On Apr 9, 2015, at 10:29 PM, Prashant Sharma scrapco...@gmail.com wrote:

Thanks,

1. http://githowto.com/getting_old_versions

Prashant Sharma

On Thu, Apr 9, 2015 at 4:40 PM, Alex Nakos ana...@gmail.com wrote:

Ok, what do i need to do in order to migrate the patch?

Thanks
Alex

On Thu, Apr 9, 2015 at 11:54 AM, Prashant Sharma scrapco...@gmail.com
wrote:

Prashant Sharma

On Thu, Apr 9, 2015 at 4:16 PM, Alex Nakos ana...@gmail.com wrote:

Hi-

Was this the JIRA issue?
https://issues.apache.org/jira/browse/SPARK-2988

Any help in getting this working would be much appreciated!

Thanks
Alex

On Thu, Apr 9, 2015 at 11:32 AM, Prashant Sharma scrapco...@gmail.com
wrote:

You are right this needs to be done. I can work on it soon, I was not
sure if there is any one even using scala 2.11 spark repl. Actually there is
a patch in scala 2.10 shell to support adding jars (Lost the JIRA ID), which
has to be ported for scala 2.11 too. If however, you(or anyone else) are
planning to work, I can help you ?

Prashant Sharma

On Thu, Apr 9, 2015 at 3:08 PM, anakos ana...@gmail.com wrote:

Hi-

I am having difficulty getting

Re: External JARs not loading Spark Shell Scala 2.11

2015-04-09 Thread Alex Nakos

Ok, what do i need to do in order to migrate the patch?

Thanks
Alex

On Thu, Apr 9, 2015 at 11:54 AM, Prashant Sharma scrapco...@gmail.com
wrote:

 This is the jira I referred to
 https://issues.apache.org/jira/browse/SPARK-3256. Another reason for not
 working on it is evaluating priority between upgrading to scala 2.11.5(it
 is non trivial I suppose because repl has changed a bit) or migrating that
 patch is much simpler.

 Prashant Sharma



 On Thu, Apr 9, 2015 at 4:16 PM, Alex Nakos ana...@gmail.com wrote:

 Hi-

 Was this the JIRA issue? https://issues.apache.org/jira/browse/SPARK-2988

 Any help in getting this working would be much appreciated!

 Thanks
 Alex

 On Thu, Apr 9, 2015 at 11:32 AM, Prashant Sharma scrapco...@gmail.com
 wrote:

 You are right this needs to be done. I can work on it soon, I was not
 sure if there is any one even using scala 2.11 spark repl. Actually there
 is a patch in scala 2.10 shell to support adding jars (Lost the JIRA ID),
 which has to be ported for scala 2.11 too. If however, you(or anyone else)
 are planning to work, I can help you ?

 Prashant Sharma



 On Thu, Apr 9, 2015 at 3:08 PM, anakos ana...@gmail.com wrote:

 Hi-

 I am having difficulty getting the 1.3.0 Spark shell to find an external
 jar.  I have build Spark locally for Scala 2.11 and I am starting the
 REPL
 as follows:

 bin/spark-shell --master yarn --jars data-api-es-data-export-4.0.0.jar

 I see the following line in the console output:

 15/04/09 09:52:15 INFO spark.SparkContext: Added JAR

 file:/opt/spark/spark-1.3.0_2.11-hadoop2.3/data-api-es-data-export-4.0.0.jar
 at http://192.168.115.31:54421/jars/data-api-es-data-export-4.0.0.jar
 with
 timestamp 1428569535904

 but when i try to import anything from this jar, it's simply not
 available.
 When I try to add the jar manually using the command

 :cp /path/to/jar

 the classes in the jar are still unavailable. I understand that 2.11 is
 not
 officially supported, but has anyone been able to get an external jar
 loaded
 in the 1.3.0 release?  Is this a known issue? I have tried searching
 around
 for answers but the only thing I've found that may be related is this:

 https://issues.apache.org/jira/browse/SPARK-3257

 Any/all help is much appreciated.
 Thanks
 Alex



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/External-JARs-not-loading-Spark-Shell-Scala-2-11-tp22434.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: External JARs not loading Spark Shell Scala 2.11

2015-04-09 Thread Alex Nakos

Hi-

Was this the JIRA issue? https://issues.apache.org/jira/browse/SPARK-2988

Any help in getting this working would be much appreciated!

Thanks
Alex

On Thu, Apr 9, 2015 at 11:32 AM, Prashant Sharma scrapco...@gmail.com
wrote:

You are right this needs to be done. I can work on it soon, I was not sure
if there is any one even using scala 2.11 spark repl. Actually there is a
patch in scala 2.10 shell to support adding jars (Lost the JIRA ID), which
has to be ported for scala 2.11 too. If however, you(or anyone else) are
planning to work, I can help you ?

Prashant Sharma

On Thu, Apr 9, 2015 at 3:08 PM, anakos ana...@gmail.com wrote:

Hi-

I am having difficulty getting the 1.3.0 Spark shell to find an external
jar. I have build Spark locally for Scala 2.11 and I am starting the REPL
as follows:

bin/spark-shell --master yarn --jars data-api-es-data-export-4.0.0.jar

I see the following line in the console output:

15/04/09 09:52:15 INFO spark.SparkContext: Added JAR

file:/opt/spark/spark-1.3.0_2.11-hadoop2.3/data-api-es-data-export-4.0.0.jar
at http://192.168.115.31:54421/jars/data-api-es-data-export-4.0.0.jar
with
timestamp 1428569535904

but when i try to import anything from this jar, it's simply not
available.
When I try to add the jar manually using the command

:cp /path/to/jar

the classes in the jar are still unavailable. I understand that 2.11 is
not
officially supported, but has anyone been able to get an external jar
loaded
in the 1.3.0 release? Is this a known issue? I have tried searching
around
for answers but the only thing I've found that may be related is this:

https://issues.apache.org/jira/browse/SPARK-3257

Any/all help is much appreciated.
Thanks
Alex

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: External JARs not loading Spark Shell Scala 2.11

2015-04-09 Thread Prashant Sharma

You are right this needs to be done. I can work on it soon, I was not sure
if there is any one even using scala 2.11 spark repl. Actually there is a
patch in scala 2.10 shell to support adding jars (Lost the JIRA ID), which
has to be ported for scala 2.11 too. If however, you(or anyone else) are
planning to work, I can help you ?

Prashant Sharma

On Thu, Apr 9, 2015 at 3:08 PM, anakos ana...@gmail.com wrote:

Hi-

I am having difficulty getting the 1.3.0 Spark shell to find an external
jar. I have build Spark locally for Scala 2.11 and I am starting the REPL
as follows:

bin/spark-shell --master yarn --jars data-api-es-data-export-4.0.0.jar

I see the following line in the console output:

15/04/09 09:52:15 INFO spark.SparkContext: Added JAR

file:/opt/spark/spark-1.3.0_2.11-hadoop2.3/data-api-es-data-export-4.0.0.jar
at http://192.168.115.31:54421/jars/data-api-es-data-export-4.0.0.jar with
timestamp 1428569535904

but when i try to import anything from this jar, it's simply not available.
When I try to add the jar manually using the command

:cp /path/to/jar

the classes in the jar are still unavailable. I understand that 2.11 is not
officially supported, but has anyone been able to get an external jar
loaded
in the 1.3.0 release? Is this a known issue? I have tried searching around
for answers but the only thing I've found that may be related is this:

https://issues.apache.org/jira/browse/SPARK-3257

Any/all help is much appreciated.
Thanks
Alex

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Running Spark from Scala source files other than main file

2015-03-11 Thread Imran Rashid

did you forget to specify the main class w/ --class Main?  though if that
was it, you should at least see *some* error message, so I'm confused
myself ...

On Wed, Mar 11, 2015 at 6:53 AM, Aung Kyaw Htet akh...@gmail.com wrote:

 Hi Everyone,

 I am developing a scala app, in which the main object does not call the
 SparkContext, but another object defined in the same package creates it,
 run spark operations and closes it. The jar file is built successfully in
 maven, but when I called spark-submit with this jar, that spark does not
 seem to execute any code.

 So my code looks like

 [Main.scala]

 object Main(args) {
   def main() {
 /*check parameters */
  Component1.start(parameters)
 }
   }

 [Component1.scala]

 object Component1{
   def start{
val sc = new SparkContext(conf)
/* do spark operations */
sc.close()
   }
 }

 The above code compiles into Main.jar but spark-submit does not execute
 anything and does not show me any error (not in the logs either.)

 spark-submit master= spark:// Main.jar

 I've got this all the code working before when I wrote a single scala
 file, but now that I am separating into multiple scala source files,
 something isn't running right.

 Any advice on this would be greatly appreciated!

 Regards,
 Aung

Re: How to pass parameters to a spark-jobserver Scala class?

2015-02-19 Thread Vasu C

Hi Sasi,

I am not sure about Vaadin, but by simple googling  you can find many
article on how to pass json parameters in http. 

http://stackoverflow.com/questions/21404252/post-request-send-json-data-java-httpurlconnection

You can also try Finagle which is fully fault tolerant framework by Twitter.


Regards,
   Vasu C



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-pass-parameters-to-a-spark-jobserver-Scala-class-tp21671p21727.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How to pass parameters to a spark-jobserver Scala class?

2015-02-18 Thread Vasu C

Hi Sasi,

Forgot to mention job server uses Typesafe Config library.  The input is
JSON, you can find syntax in below link

https://github.com/typesafehub/config



Regards,
   Vasu C



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-pass-parameters-to-a-spark-jobserver-Scala-class-tp21671p21695.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How to pass parameters to a spark-jobserver Scala class?

2015-02-18 Thread Sasi

Thank you very much Vasu. Let me add some more points to my question. We are
developing a Java program for connection spark-jobserver to Vaadin (Java
framework). Following is the sample code I wrote for connecting both (the
code works fine):
/
URL url = null;
HttpURLConnection connection = null;
String strQueryUri =
http://localhost:8090/jobs?appName=sparkingclassPath=sparking.jobserver.GetOrCreateUserscontext=user-context;;
url = new URL(strQueryUri);
connection = (HttpURLConnection) url.openConnection();
connection.setRequestMethod(POST);
connection.setRequestProperty(Accept, application/json);
InputStream isQueryJSON = connection.getInputStream();
LinkedHashMapString, Object queryMap = (LinkedHashMapString, Object)
getJSONKeyValue(isQueryJSON, null, result);
String strJobId = (String) queryMap.get(jobId);/

Can you suggest how to modify above code for passing parameters (as we do in
*curl -d ...*) during job run?

Hope I make sense.

Sasi



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-pass-parameters-to-a-spark-jobserver-Scala-class-tp21671p21717.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How to pass parameters to a spark-jobserver Scala class?

2015-02-17 Thread Vasu C

Hi Sasi,

To pass parameters to spark-jobserver usecurl -d input.string = a b c
a b see  and in Job server class use config.getString(input.string). 

You can pass multiple parameters like starttime,endtime etc and use
config.getString() to get the values.

The examples are shown here
https://github.com/spark-jobserver/spark-jobserver


Regards,
   Vasu C



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-pass-parameters-to-a-spark-jobserver-Scala-class-tp21671p21692.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark and Scala

2014-09-13 Thread Deep Pradhan

Is it always true that whenever we apply operations on an RDD, we get
another RDD?
Or does it depend on the return type of the operation?

On Sat, Sep 13, 2014 at 9:45 AM, Soumya Simanta soumya.sima...@gmail.com
wrote:


 An RDD is a fault-tolerant distributed structure. It is the primary
 abstraction in Spark.

 I would strongly suggest that you have a look at the following to get a
 basic idea.

 http://www.cs.berkeley.edu/~pwendell/strataconf/api/core/spark/RDD.html
 http://spark.apache.org/docs/latest/quick-start.html#basics

 https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/zaharia

 On Sat, Sep 13, 2014 at 12:06 AM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:

 Take for example this:
 I have declared one queue *val queue = Queue.empty[Int]*, which is a
 pure scala line in the program. I actually want the queue to be an RDD but
 there are no direct methods to create RDD which is a queue right? What say
 do you have on this?
 Does there exist something like: *Create and RDD which is a queue *?

 On Sat, Sep 13, 2014 at 8:43 AM, Hari Shreedharan 
 hshreedha...@cloudera.com wrote:

 No, Scala primitives remain primitives. Unless you create an RDD using
 one of the many methods - you would not be able to access any of the RDD
 methods. There is no automatic porting. Spark is an application as far as
 scala is concerned - there is no compilation (except of course, the scala,
 JIT compilation etc).

 On Fri, Sep 12, 2014 at 8:04 PM, Deep Pradhan pradhandeep1...@gmail.com
  wrote:

 I know that unpersist is a method on RDD.
 But my confusion is that, when we port our Scala programs to Spark,
 doesn't everything change to RDDs?

 On Fri, Sep 12, 2014 at 10:16 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 unpersist is a method on RDDs. RDDs are abstractions introduced by
 Spark.

 An Int is just a Scala Int. You can't call unpersist on Int in Scala,
 and that doesn't change in Spark.

 On Fri, Sep 12, 2014 at 12:33 PM, Deep Pradhan 
 pradhandeep1...@gmail.com wrote:

 There is one thing that I am confused about.
 Spark has codes that have been implemented in Scala. Now, can we run
 any Scala code on the Spark framework? What will be the difference in the
 execution of the scala code in normal systems and on Spark?
 The reason for my question is the following:
 I had a variable
 *val temp = some operations*
 This temp was being created inside the loop, so as to manually throw
 it out of the cache, every time the loop ends I was calling
 *temp.unpersist()*, this was returning an error saying that *value
 unpersist is not a method of Int*, which means that temp is an Int.
 Can some one explain to me why I was not able to call *unpersist* on
 *temp*?

 Thank You

Re: Spark and Scala

2014-09-13 Thread Mark Hamstra

This is all covered in
http://spark.apache.org/docs/latest/programming-guide.html#rdd-operations

By definition, RDD transformations take an RDD to another RDD; actions
produce some other type as a value on the driver program.

On Fri, Sep 12, 2014 at 11:15 PM, Deep Pradhan pradhandeep1...@gmail.com
wrote:

 Is it always true that whenever we apply operations on an RDD, we get
 another RDD?
 Or does it depend on the return type of the operation?

 On Sat, Sep 13, 2014 at 9:45 AM, Soumya Simanta soumya.sima...@gmail.com
 wrote:


 An RDD is a fault-tolerant distributed structure. It is the primary
 abstraction in Spark.

 I would strongly suggest that you have a look at the following to get a
 basic idea.

 http://www.cs.berkeley.edu/~pwendell/strataconf/api/core/spark/RDD.html
 http://spark.apache.org/docs/latest/quick-start.html#basics

 https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/zaharia

 On Sat, Sep 13, 2014 at 12:06 AM, Deep Pradhan pradhandeep1...@gmail.com
  wrote:

 Take for example this:
 I have declared one queue *val queue = Queue.empty[Int]*, which is a
 pure scala line in the program. I actually want the queue to be an RDD but
 there are no direct methods to create RDD which is a queue right? What say
 do you have on this?
 Does there exist something like: *Create and RDD which is a queue *?

 On Sat, Sep 13, 2014 at 8:43 AM, Hari Shreedharan 
 hshreedha...@cloudera.com wrote:

 No, Scala primitives remain primitives. Unless you create an RDD using
 one of the many methods - you would not be able to access any of the RDD
 methods. There is no automatic porting. Spark is an application as far as
 scala is concerned - there is no compilation (except of course, the scala,
 JIT compilation etc).

 On Fri, Sep 12, 2014 at 8:04 PM, Deep Pradhan 
 pradhandeep1...@gmail.com wrote:

 I know that unpersist is a method on RDD.
 But my confusion is that, when we port our Scala programs to Spark,
 doesn't everything change to RDDs?

 On Fri, Sep 12, 2014 at 10:16 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 unpersist is a method on RDDs. RDDs are abstractions introduced by
 Spark.

 An Int is just a Scala Int. You can't call unpersist on Int in Scala,
 and that doesn't change in Spark.

 On Fri, Sep 12, 2014 at 12:33 PM, Deep Pradhan 
 pradhandeep1...@gmail.com wrote:

 There is one thing that I am confused about.
 Spark has codes that have been implemented in Scala. Now, can we run
 any Scala code on the Spark framework? What will be the difference in 
 the
 execution of the scala code in normal systems and on Spark?
 The reason for my question is the following:
 I had a variable
 *val temp = some operations*
 This temp was being created inside the loop, so as to manually throw
 it out of the cache, every time the loop ends I was calling
 *temp.unpersist()*, this was returning an error saying that *value
 unpersist is not a method of Int*, which means that temp is an Int.
 Can some one explain to me why I was not able to call *unpersist*
 on *temp*?

 Thank You

Re: Spark and Scala

2014-09-13 Thread Deep Pradhan

Take for example this:


*val lines = sc.textFile(args(0))*
*val nodes = lines.map(s ={  *
*val fields = s.split(\\s+)*
*(fields(0),fields(1))*
*}).distinct().groupByKey().cache() *

*val nodeSizeTuple = nodes.map(node = (node._1.toInt, node._2.size))*
*val rootNode = nodeSizeTuple.top(1)(Ordering.by(f = f._2))*

The nodeSizeTuple is an RDD,but rootNode is an array. Here I have used all
RDD operations, but I am getting an array.
What about this case?

On Sat, Sep 13, 2014 at 11:45 AM, Deep Pradhan pradhandeep1...@gmail.com
wrote:

 Is it always true that whenever we apply operations on an RDD, we get
 another RDD?
 Or does it depend on the return type of the operation?

 On Sat, Sep 13, 2014 at 9:45 AM, Soumya Simanta soumya.sima...@gmail.com
 wrote:


 An RDD is a fault-tolerant distributed structure. It is the primary
 abstraction in Spark.

 I would strongly suggest that you have a look at the following to get a
 basic idea.

 http://www.cs.berkeley.edu/~pwendell/strataconf/api/core/spark/RDD.html
 http://spark.apache.org/docs/latest/quick-start.html#basics

 https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/zaharia

 On Sat, Sep 13, 2014 at 12:06 AM, Deep Pradhan pradhandeep1...@gmail.com
  wrote:

 Take for example this:
 I have declared one queue *val queue = Queue.empty[Int]*, which is a
 pure scala line in the program. I actually want the queue to be an RDD but
 there are no direct methods to create RDD which is a queue right? What say
 do you have on this?
 Does there exist something like: *Create and RDD which is a queue *?

 On Sat, Sep 13, 2014 at 8:43 AM, Hari Shreedharan 
 hshreedha...@cloudera.com wrote:

 No, Scala primitives remain primitives. Unless you create an RDD using
 one of the many methods - you would not be able to access any of the RDD
 methods. There is no automatic porting. Spark is an application as far as
 scala is concerned - there is no compilation (except of course, the scala,
 JIT compilation etc).

 On Fri, Sep 12, 2014 at 8:04 PM, Deep Pradhan 
 pradhandeep1...@gmail.com wrote:

 I know that unpersist is a method on RDD.
 But my confusion is that, when we port our Scala programs to Spark,
 doesn't everything change to RDDs?

 On Fri, Sep 12, 2014 at 10:16 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 unpersist is a method on RDDs. RDDs are abstractions introduced by
 Spark.

 An Int is just a Scala Int. You can't call unpersist on Int in Scala,
 and that doesn't change in Spark.

 On Fri, Sep 12, 2014 at 12:33 PM, Deep Pradhan 
 pradhandeep1...@gmail.com wrote:

 There is one thing that I am confused about.
 Spark has codes that have been implemented in Scala. Now, can we run
 any Scala code on the Spark framework? What will be the difference in 
 the
 execution of the scala code in normal systems and on Spark?
 The reason for my question is the following:
 I had a variable
 *val temp = some operations*
 This temp was being created inside the loop, so as to manually throw
 it out of the cache, every time the loop ends I was calling
 *temp.unpersist()*, this was returning an error saying that *value
 unpersist is not a method of Int*, which means that temp is an Int.
 Can some one explain to me why I was not able to call *unpersist*
 on *temp*?

 Thank You

Re: Spark and Scala

2014-09-13 Thread Mark Hamstra

Again, RDD operations are of two basic varieties: transformations, that
produce further RDDs; and operations, that return values to the driver
program.  You've used several RDD transformations and then finally the
top(1) action, which returns an array of one element to your driver
program.  That is exactly what you should expect from the description of
RDD#top in the API.
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD

On Sat, Sep 13, 2014 at 12:34 AM, Deep Pradhan pradhandeep1...@gmail.com
wrote:

 Take for example this:


 *val lines = sc.textFile(args(0))*
 *val nodes = lines.map(s ={  *
 *val fields = s.split(\\s+)*
 *(fields(0),fields(1))*
 *}).distinct().groupByKey().cache() *

 *val nodeSizeTuple = nodes.map(node = (node._1.toInt, node._2.size))*
 *val rootNode = nodeSizeTuple.top(1)(Ordering.by(f = f._2))*

 The nodeSizeTuple is an RDD,but rootNode is an array. Here I have used all
 RDD operations, but I am getting an array.
 What about this case?

 On Sat, Sep 13, 2014 at 11:45 AM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:

 Is it always true that whenever we apply operations on an RDD, we get
 another RDD?
 Or does it depend on the return type of the operation?

 On Sat, Sep 13, 2014 at 9:45 AM, Soumya Simanta soumya.sima...@gmail.com
  wrote:


 An RDD is a fault-tolerant distributed structure. It is the primary
 abstraction in Spark.

 I would strongly suggest that you have a look at the following to get a
 basic idea.

 http://www.cs.berkeley.edu/~pwendell/strataconf/api/core/spark/RDD.html
 http://spark.apache.org/docs/latest/quick-start.html#basics

 https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/zaharia

 On Sat, Sep 13, 2014 at 12:06 AM, Deep Pradhan 
 pradhandeep1...@gmail.com wrote:

 Take for example this:
 I have declared one queue *val queue = Queue.empty[Int]*, which is a
 pure scala line in the program. I actually want the queue to be an RDD but
 there are no direct methods to create RDD which is a queue right? What say
 do you have on this?
 Does there exist something like: *Create and RDD which is a queue *?

 On Sat, Sep 13, 2014 at 8:43 AM, Hari Shreedharan 
 hshreedha...@cloudera.com wrote:

 No, Scala primitives remain primitives. Unless you create an RDD using
 one of the many methods - you would not be able to access any of the RDD
 methods. There is no automatic porting. Spark is an application as far as
 scala is concerned - there is no compilation (except of course, the scala,
 JIT compilation etc).

 On Fri, Sep 12, 2014 at 8:04 PM, Deep Pradhan 
 pradhandeep1...@gmail.com wrote:

 I know that unpersist is a method on RDD.
 But my confusion is that, when we port our Scala programs to Spark,
 doesn't everything change to RDDs?

 On Fri, Sep 12, 2014 at 10:16 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 unpersist is a method on RDDs. RDDs are abstractions introduced by
 Spark.

 An Int is just a Scala Int. You can't call unpersist on Int in
 Scala, and that doesn't change in Spark.

 On Fri, Sep 12, 2014 at 12:33 PM, Deep Pradhan 
 pradhandeep1...@gmail.com wrote:

 There is one thing that I am confused about.
 Spark has codes that have been implemented in Scala. Now, can we
 run any Scala code on the Spark framework? What will be the difference 
 in
 the execution of the scala code in normal systems and on Spark?
 The reason for my question is the following:
 I had a variable
 *val temp = some operations*
 This temp was being created inside the loop, so as to manually
 throw it out of the cache, every time the loop ends I was calling
 *temp.unpersist()*, this was returning an error saying that *value
 unpersist is not a method of Int*, which means that temp is an
 Int.
 Can some one explain to me why I was not able to call *unpersist*
 on *temp*?

 Thank You

Spark and Scala

2014-09-12 Thread Deep Pradhan

There is one thing that I am confused about.
Spark has codes that have been implemented in Scala. Now, can we run any
Scala code on the Spark framework? What will be the difference in the
execution of the scala code in normal systems and on Spark?
The reason for my question is the following:
I had a variable
*val temp = some operations*
This temp was being created inside the loop, so as to manually throw it out
of the cache, every time the loop ends I was calling *temp.unpersist()*,
this was returning an error saying that *value unpersist is not a method of
Int*, which means that temp is an Int.
Can some one explain to me why I was not able to call *unpersist* on *temp*?

Thank You

Re: Spark and Scala

2014-09-12 Thread Nicholas Chammas

unpersist is a method on RDDs. RDDs are abstractions introduced by Spark.

An Int is just a Scala Int. You can't call unpersist on Int in Scala, and
that doesn't change in Spark.

On Fri, Sep 12, 2014 at 12:33 PM, Deep Pradhan pradhandeep1...@gmail.com
wrote:

 There is one thing that I am confused about.
 Spark has codes that have been implemented in Scala. Now, can we run any
 Scala code on the Spark framework? What will be the difference in the
 execution of the scala code in normal systems and on Spark?
 The reason for my question is the following:
 I had a variable
 *val temp = some operations*
 This temp was being created inside the loop, so as to manually throw it
 out of the cache, every time the loop ends I was calling
 *temp.unpersist()*, this was returning an error saying that *value
 unpersist is not a method of Int*, which means that temp is an Int.
 Can some one explain to me why I was not able to call *unpersist* on
 *temp*?

 Thank You

Re: Spark and Scala

2014-09-12 Thread Deep Pradhan

I know that unpersist is a method on RDD.
But my confusion is that, when we port our Scala programs to Spark, doesn't
everything change to RDDs?

On Fri, Sep 12, 2014 at 10:16 PM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 unpersist is a method on RDDs. RDDs are abstractions introduced by Spark.

 An Int is just a Scala Int. You can't call unpersist on Int in Scala, and
 that doesn't change in Spark.

 On Fri, Sep 12, 2014 at 12:33 PM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:

 There is one thing that I am confused about.
 Spark has codes that have been implemented in Scala. Now, can we run any
 Scala code on the Spark framework? What will be the difference in the
 execution of the scala code in normal systems and on Spark?
 The reason for my question is the following:
 I had a variable
 *val temp = some operations*
 This temp was being created inside the loop, so as to manually throw it
 out of the cache, every time the loop ends I was calling
 *temp.unpersist()*, this was returning an error saying that *value
 unpersist is not a method of Int*, which means that temp is an Int.
 Can some one explain to me why I was not able to call *unpersist* on
 *temp*?

 Thank You

Re: Spark and Scala

2014-09-12 Thread Hari Shreedharan

No, Scala primitives remain primitives. Unless you create an RDD using one
of the many methods - you would not be able to access any of the RDD
methods. There is no automatic porting. Spark is an application as far as
scala is concerned - there is no compilation (except of course, the scala,
JIT compilation etc).

On Fri, Sep 12, 2014 at 8:04 PM, Deep Pradhan pradhandeep1...@gmail.com
wrote:

 I know that unpersist is a method on RDD.
 But my confusion is that, when we port our Scala programs to Spark,
 doesn't everything change to RDDs?

 On Fri, Sep 12, 2014 at 10:16 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 unpersist is a method on RDDs. RDDs are abstractions introduced by Spark.

 An Int is just a Scala Int. You can't call unpersist on Int in Scala, and
 that doesn't change in Spark.

 On Fri, Sep 12, 2014 at 12:33 PM, Deep Pradhan pradhandeep1...@gmail.com
  wrote:

 There is one thing that I am confused about.
 Spark has codes that have been implemented in Scala. Now, can we run any
 Scala code on the Spark framework? What will be the difference in the
 execution of the scala code in normal systems and on Spark?
 The reason for my question is the following:
 I had a variable
 *val temp = some operations*
 This temp was being created inside the loop, so as to manually throw it
 out of the cache, every time the loop ends I was calling
 *temp.unpersist()*, this was returning an error saying that *value
 unpersist is not a method of Int*, which means that temp is an Int.
 Can some one explain to me why I was not able to call *unpersist* on
 *temp*?

 Thank You

Re: Spark and Scala

2014-09-12 Thread Deep Pradhan

Take for example this:
I have declared one queue *val queue = Queue.empty[Int]*, which is a pure
scala line in the program. I actually want the queue to be an RDD but there
are no direct methods to create RDD which is a queue right? What say do you
have on this?
Does there exist something like: *Create and RDD which is a queue *?

On Sat, Sep 13, 2014 at 8:43 AM, Hari Shreedharan hshreedha...@cloudera.com
 wrote:

 No, Scala primitives remain primitives. Unless you create an RDD using one
 of the many methods - you would not be able to access any of the RDD
 methods. There is no automatic porting. Spark is an application as far as
 scala is concerned - there is no compilation (except of course, the scala,
 JIT compilation etc).

 On Fri, Sep 12, 2014 at 8:04 PM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:

 I know that unpersist is a method on RDD.
 But my confusion is that, when we port our Scala programs to Spark,
 doesn't everything change to RDDs?

 On Fri, Sep 12, 2014 at 10:16 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 unpersist is a method on RDDs. RDDs are abstractions introduced by Spark.

 An Int is just a Scala Int. You can't call unpersist on Int in Scala,
 and that doesn't change in Spark.

 On Fri, Sep 12, 2014 at 12:33 PM, Deep Pradhan 
 pradhandeep1...@gmail.com wrote:

 There is one thing that I am confused about.
 Spark has codes that have been implemented in Scala. Now, can we run
 any Scala code on the Spark framework? What will be the difference in the
 execution of the scala code in normal systems and on Spark?
 The reason for my question is the following:
 I had a variable
 *val temp = some operations*
 This temp was being created inside the loop, so as to manually throw it
 out of the cache, every time the loop ends I was calling
 *temp.unpersist()*, this was returning an error saying that *value
 unpersist is not a method of Int*, which means that temp is an Int.
 Can some one explain to me why I was not able to call *unpersist* on
 *temp*?

 Thank You

Re: Spark and Scala

2014-09-12 Thread Soumya Simanta

An RDD is a fault-tolerant distributed structure. It is the primary
abstraction in Spark.

I would strongly suggest that you have a look at the following to get a
basic idea.

http://www.cs.berkeley.edu/~pwendell/strataconf/api/core/spark/RDD.html
http://spark.apache.org/docs/latest/quick-start.html#basics
https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/zaharia

On Sat, Sep 13, 2014 at 12:06 AM, Deep Pradhan pradhandeep1...@gmail.com
wrote:

 Take for example this:
 I have declared one queue *val queue = Queue.empty[Int]*, which is a pure
 scala line in the program. I actually want the queue to be an RDD but there
 are no direct methods to create RDD which is a queue right? What say do you
 have on this?
 Does there exist something like: *Create and RDD which is a queue *?

 On Sat, Sep 13, 2014 at 8:43 AM, Hari Shreedharan 
 hshreedha...@cloudera.com wrote:

 No, Scala primitives remain primitives. Unless you create an RDD using
 one of the many methods - you would not be able to access any of the RDD
 methods. There is no automatic porting. Spark is an application as far as
 scala is concerned - there is no compilation (except of course, the scala,
 JIT compilation etc).

 On Fri, Sep 12, 2014 at 8:04 PM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:

 I know that unpersist is a method on RDD.
 But my confusion is that, when we port our Scala programs to Spark,
 doesn't everything change to RDDs?

 On Fri, Sep 12, 2014 at 10:16 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 unpersist is a method on RDDs. RDDs are abstractions introduced by
 Spark.

 An Int is just a Scala Int. You can't call unpersist on Int in Scala,
 and that doesn't change in Spark.

 On Fri, Sep 12, 2014 at 12:33 PM, Deep Pradhan 
 pradhandeep1...@gmail.com wrote:

 There is one thing that I am confused about.
 Spark has codes that have been implemented in Scala. Now, can we run
 any Scala code on the Spark framework? What will be the difference in the
 execution of the scala code in normal systems and on Spark?
 The reason for my question is the following:
 I had a variable
 *val temp = some operations*
 This temp was being created inside the loop, so as to manually throw
 it out of the cache, every time the loop ends I was calling
 *temp.unpersist()*, this was returning an error saying that *value
 unpersist is not a method of Int*, which means that temp is an Int.
 Can some one explain to me why I was not able to call *unpersist* on
 *temp*?

 Thank You

Re: Running spark examples/scala scripts

2014-03-19 Thread Pariksheet Barapatre

:-) Thanks for suggestion.

I was actually asking how to run spark scripts as a standalone App. I am
able to run Java code and Python code as standalone app.

 one more doubt, documentation says - to read HDFS file, we need to add
dependency
groupIdorg.apache.hadoop/groupId
artifactIdhadoop-client/artifactId
version1.0.1/version
/dependency

How to know HDFS version, I just guess 1.0.1 and it worked.


Next task is to run scala code with sbt.

Cheers
Pari


On 18 March 2014 22:33, Mayur Rustagi mayur.rust...@gmail.com wrote:

 print out the last line  run it outside on the shell :)

 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Tue, Mar 18, 2014 at 2:37 AM, Pariksheet Barapatre 
 pbarapa...@gmail.com wrote:

 Hello all,

 I am trying to run shipped in example with spark i.e. in example
 directory.

 [cloudera@aster2 examples]$ ls
 bagel   ExceptionHandlingTest.scala  HdfsTest2.scala
LocalKMeans.scala  MultiBroadcastTest.scala   SparkHdfsLR.scala
  SparkPi.scala
 BroadcastTest.scala graphx   HdfsTest.scala
   LocalLR.scala  SimpleSkewedGroupByTest.scala  SparkKMeans.scala
  SparkTC.scala
 CassandraTest.scala GroupByTest.scalaLocalALS.scala
   LocalPi.scala  SkewedGroupByTest.scalaSparkLR.scala
 DriverSubmissionTest.scala  HBaseTest.scala
  LocalFileLR.scala  LogQuery.scala SparkALS.scala
 SparkPageRank.scala


 I am able to run these examples using run-example script, but how to run
 these examples without using run-example script.




 Thanks,
 Pari





-- 
Cheers,
Pari

97 matches

Mail list logo