from:"Sean Owen"

Re: Spark 2.0 with Hadoop 3.0?

2016-10-28 Thread Sean Owen

I don't think it works, but, there is no Hadoop 3.0 right now either. As
the version implies, it's going to be somewhat different API-wise.

On Thu, Oct 27, 2016 at 11:04 PM adam kramer  wrote:

> Is the version of Spark built for Hadoop 2.7 and later only for 2.x
> releases?
>
> Is there any reason why Hadoop 3.0 is a non-starter for use with Spark
> 2.0? The version of aws-sdk in 3.0 actually works for DynamoDB which
> would resolve our driver dependency issues.
>
> Thanks,
> Adam
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: Executor shutdown hook and initialization

2016-10-27 Thread Sean Owen

Init is easy -- initialize them in your singleton.
Shutdown is harder; a shutdown hook is probably the only reliable way to go.
Global state is not ideal in Spark. Consider initializing things like
connections per partition, and open/close them with the lifecycle of a
computation on a partition instead.

On Wed, Oct 26, 2016 at 9:27 PM Walter rakoff 
wrote:

> Hello,
>
> Is there a way I can add an init() call when an executor is created? I'd
> like to initialize a few connections that are part of my singleton object.
> Preferably this happens before it runs the first task
> On the same line, how can I provide an shutdown hook that cleans up these
> connections on termination.
>
> Thanks
> Walt
>

Re: HiveContext is Serialized?

2016-10-26 Thread Sean Owen

Yes, but the question here is why the context objects are marked
serializable when they are not meant to be sent somewhere as bytes. I tried
to answer that apparent inconsistency below.

On Wed, Oct 26, 2016, 10:21 Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Hi,
>
> Sorry for asking this rather naïve question.
>
> The notion of serialisation in Spark and where it can be serialised or
> not. Does this generally refer to the concept of serialisation in the
> context of data storage?
>
> In this context for example with reference to RDD operations is it process
> of translating object state into a format that can be stored and
> retrieved from memory buffer?
>
> Thanks
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 26 October 2016 at 09:06, Sean Owen <so...@cloudera.com> wrote:
>
> It is the driver that has the info needed to schedule and manage
> distributed jobs and that is by design.
>
> This is narrowly about using the HiveContext or SparkContext directly. Of
> course SQL operations are distributed.
>
>
> On Wed, Oct 26, 2016, 10:03 Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
> Hi Sean,
>
> Your point:
>
> "You can't use the HiveContext or SparkContext in a distribution
> operation..."
>
> Is this because of design issue?
>
> Case in point if I created a DF from RDD and register it as a tempTable,
> does this imply that any sql calls on that table islocalised and not
> distributed among executors?
>
> Thanks
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 26 October 2016 at 06:43, Ajay Chander <itsche...@gmail.com> wrote:
>
> Sean, thank you for making it clear. It was helpful.
>
> Regards,
> Ajay
>
>
> On Wednesday, October 26, 2016, Sean Owen <so...@cloudera.com> wrote:
>
> This usage is fine, because you are only using the HiveContext locally on
> the driver. It's applied in a function that's used on a Scala collection.
>
> You can't use the HiveContext or SparkContext in a distribution operation.
> It has nothing to do with for loops.
>
> The fact that they're serializable is misleading. It's there, I believe,
> because these objects may be inadvertently referenced in the closure of a
> function that executes remotely, yet doesn't use the context. The closure
> cleaner can't always remove this reference. The task would fail to
> serialize even though it doesn't use the context. You will find these
> objects serialize but then don't work if used remotely.
>
> The NPE you see is an unrelated cosmetic problem that was fixed in 2.0.1
> IIRC.
>
> On Wed, Oct 26, 2016 at 4:28 AM Ajay Chander <itsche...@gmail.com> wrote:
>
> Hi Everyone,
>
> I was thinking if I can use hiveContext inside foreach like below,
>
> object Test {
>   def main(args: Array[String]): Unit = {
>
> val conf = new SparkConf()
> val sc = new SparkContext(conf)
> val hiveContext = new HiveContext(sc)
>
> val dataElementsFile = args(0)
> val deDF = 
> hiveContext.read.text(dataElementsFile).toDF("DataElement").coalesce(1).distinct().cache()
>
> def calculate(de: Row) {
>   val dataElement = de.getAs[String]("DataElement").trim
>   val df1 = hiveContext.sql("SELECT cyc_dt, supplier_proc_i, '" + 
> dataElement + "' as data_elm, " + dataElement + " as data_elm_val FROM 
> TEST_DB.TEST_TABLE1 ")
>   df1.write.insertInto("TEST_DB.TEST_TABLE1")
> }
>
> deDF.collect().foreach(calculate)
>   }
> }
>
>
> I looked at 
> https://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.sql.hive.HiveContext
>  and I see it is extending SqlContext which extends Logging with Serializable.
>
> Can anyone tell me if this is the right way to use it ? Thanks for your time.
>
> Regards,
>
> Ajay
>
>
>
>

Re: HiveContext is Serialized?

2016-10-26 Thread Sean Owen

It is the driver that has the info needed to schedule and manage
distributed jobs and that is by design.

This is narrowly about using the HiveContext or SparkContext directly. Of
course SQL operations are distributed.

On Wed, Oct 26, 2016, 10:03 Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Hi Sean,
>
> Your point:
>
> "You can't use the HiveContext or SparkContext in a distribution
> operation..."
>
> Is this because of design issue?
>
> Case in point if I created a DF from RDD and register it as a tempTable,
> does this imply that any sql calls on that table islocalised and not
> distributed among executors?
>
> Thanks
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 26 October 2016 at 06:43, Ajay Chander <itsche...@gmail.com> wrote:
>
> Sean, thank you for making it clear. It was helpful.
>
> Regards,
> Ajay
>
>
> On Wednesday, October 26, 2016, Sean Owen <so...@cloudera.com> wrote:
>
> This usage is fine, because you are only using the HiveContext locally on
> the driver. It's applied in a function that's used on a Scala collection.
>
> You can't use the HiveContext or SparkContext in a distribution operation.
> It has nothing to do with for loops.
>
> The fact that they're serializable is misleading. It's there, I believe,
> because these objects may be inadvertently referenced in the closure of a
> function that executes remotely, yet doesn't use the context. The closure
> cleaner can't always remove this reference. The task would fail to
> serialize even though it doesn't use the context. You will find these
> objects serialize but then don't work if used remotely.
>
> The NPE you see is an unrelated cosmetic problem that was fixed in 2.0.1
> IIRC.
>
> On Wed, Oct 26, 2016 at 4:28 AM Ajay Chander <itsche...@gmail.com> wrote:
>
> Hi Everyone,
>
> I was thinking if I can use hiveContext inside foreach like below,
>
> object Test {
>   def main(args: Array[String]): Unit = {
>
> val conf = new SparkConf()
> val sc = new SparkContext(conf)
> val hiveContext = new HiveContext(sc)
>
> val dataElementsFile = args(0)
> val deDF = 
> hiveContext.read.text(dataElementsFile).toDF("DataElement").coalesce(1).distinct().cache()
>
> def calculate(de: Row) {
>   val dataElement = de.getAs[String]("DataElement").trim
>   val df1 = hiveContext.sql("SELECT cyc_dt, supplier_proc_i, '" + 
> dataElement + "' as data_elm, " + dataElement + " as data_elm_val FROM 
> TEST_DB.TEST_TABLE1 ")
>   df1.write.insertInto("TEST_DB.TEST_TABLE1")
> }
>
> deDF.collect().foreach(calculate)
>   }
> }
>
>
> I looked at 
> https://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.sql.hive.HiveContext
>  and I see it is extending SqlContext which extends Logging with Serializable.
>
> Can anyone tell me if this is the right way to use it ? Thanks for your time.
>
> Regards,
>
> Ajay
>
>
>

Re: HiveContext is Serialized?

2016-10-25 Thread Sean Owen

This usage is fine, because you are only using the HiveContext locally on
the driver. It's applied in a function that's used on a Scala collection.

You can't use the HiveContext or SparkContext in a distribution operation.
It has nothing to do with for loops.

The fact that they're serializable is misleading. It's there, I believe,
because these objects may be inadvertently referenced in the closure of a
function that executes remotely, yet doesn't use the context. The closure
cleaner can't always remove this reference. The task would fail to
serialize even though it doesn't use the context. You will find these
objects serialize but then don't work if used remotely.

The NPE you see is an unrelated cosmetic problem that was fixed in 2.0.1
IIRC.

On Wed, Oct 26, 2016 at 4:28 AM Ajay Chander  wrote:

> Hi Everyone,
>
> I was thinking if I can use hiveContext inside foreach like below,
>
> object Test {
>   def main(args: Array[String]): Unit = {
>
> val conf = new SparkConf()
> val sc = new SparkContext(conf)
> val hiveContext = new HiveContext(sc)
>
> val dataElementsFile = args(0)
> val deDF = 
> hiveContext.read.text(dataElementsFile).toDF("DataElement").coalesce(1).distinct().cache()
>
> def calculate(de: Row) {
>   val dataElement = de.getAs[String]("DataElement").trim
>   val df1 = hiveContext.sql("SELECT cyc_dt, supplier_proc_i, '" + 
> dataElement + "' as data_elm, " + dataElement + " as data_elm_val FROM 
> TEST_DB.TEST_TABLE1 ")
>   df1.write.insertInto("TEST_DB.TEST_TABLE1")
> }
>
> deDF.collect().foreach(calculate)
>   }
> }
>
>
> I looked at 
> https://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.sql.hive.HiveContext
>  and I see it is extending SqlContext which extends Logging with Serializable.
>
> Can anyone tell me if this is the right way to use it ? Thanks for your time.
>
> Regards,
>
> Ajay
>
>

Re: Spark 1.2

2016-10-25 Thread Sean Owen

archive.apache.org will always have all the releases:
http://archive.apache.org/dist/spark/

On Tue, Oct 25, 2016 at 1:17 PM ayan guha  wrote:

> Just in case, anyone knows how I can download Spark 1.2? It is not showing
> up in Spark download page drop down
>
>
> --
> Best Regards,
> Ayan Guha
>

Re: Generate random numbers from Normal Distribution with Specific Mean and Variance

2016-10-24 Thread Sean Owen

In the context of Spark, there are already things like RandomRDD and SQL
randn() to generate random standard normal variables.

If you want to do it directly, Commons Math is a good choice in the JVM,
among others.

Once you have a standard normal, just multiply by the stdev and add the
mean to get any other univariate normal. No need for special support for it.

On Mon, Oct 24, 2016 at 5:05 PM Mich Talebzadeh 
wrote:

> me being lazy
>
> Does anyone have a library to create an array of random numbers from
> normal distribution with a given mean and variance by any chance?
>
> Something like here
> 
>
>
> Thanks
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>

Re: pyspark doesn't recognize MMM dateFormat pattern in spark.read.load() for dates like 1989Dec31 and 31Dec1989

2016-10-24 Thread Sean Owen

I believe it will be too late to set it there, and these are JVM flags, not
app or Spark flags. See spark.driver.extraJavaOptions and likewise for the
executor.

On Mon, Oct 24, 2016 at 4:04 PM Pietro Pugni  wrote:

> Thank you!
>
> I tried again setting locale options in different ways but doesn’t
> propagate to the JVM. I tested these strategies (alone and all together):
> - bin/spark-submit --conf
> "spark.executor.extraJavaOptions=-Duser.language=en -Duser.region=US
> -Duser.country=US -Duser.timezone=GMT” test.py
> - spark = SparkSession \
> .builder \
> .appName("My app") \
> .config("spark.executor.extraJavaOptions", "-Duser.language=en
> -Duser.region=US -Duser.country=US -Duser.timezone=GMT") \
> .config("user.country", "US") \
> .config("user.region", "US") \
> .config("user.language", "en") \
> .config("user.timezone", "GMT") \
> .config("-Duser.country", "US") \
> .config("-Duser.region", "US") \
> .config("-Duser.language", "en") \
> .config("-Duser.timezone", "GMT") \
> .getOrCreate()
> - export JAVA_OPTS="-Duser.language=en -Duser.region=US -Duser.country=US
> -Duser.timezone=GMT”
> - export LANG="en_US.UTF-8”
>
> After running export LANG="en_US.UTF-8” from the same terminal session I
> use to launch spark-submit, if I run locale command I get correct values:
> LANG="en_US.UTF-8"
> LC_COLLATE="en_US.UTF-8"
> LC_CTYPE="en_US.UTF-8"
> LC_MESSAGES="en_US.UTF-8"
> LC_MONETARY="en_US.UTF-8"
> LC_NUMERIC="en_US.UTF-8"
> LC_TIME="en_US.UTF-8"
> LC_ALL=
>
> While running my pyspark script, from the Spark UI,  under Environment ->
> Spark Properties the locale appear to be correctly set:
> - user.country: US
> - user.language: en
> - user.region: US
> - user.timezone: GMT
>
> but Environment -> System Properties still reports the System locale and
> not the session locale I previously set:
> - user.country: IT
> - user.language: it
> - user.timezone: Europe/Rome
>
> Am I wrong or the options don’t propagate to the JVM correctly?
>
>
>

Re: reading info from spark 2.0 application UI

2016-10-24 Thread Sean Owen

What matters in this case is how many vcores YARN thinks it can allocate
per machine. I think the relevant setting is
yarn.nodemanager.resource.cpu-vcores. I bet you'll find this is actually
more than the machine's number of cores, possibly on purpose, to enable
some over-committing.

On Mon, Oct 24, 2016 at 4:13 PM TheGeorge1918 . <zhangxuan1...@gmail.com>
wrote:

> Yep. I'm pretty sure that 4 executors are on 1 machine. I use yarn in emr.
> I have another "faulty" configuration with 9 executors and 5 of them are on
> one machine. Each one with 9 cores which adds up to 45 cores in that
> machine (the total vcores is 40). Still it works. The total number of
> vcores is 80 in the cluster but I get 81 in total from executors exclusing
> the cores for driver and system. I use aws emr ec2 instance where it
> specifies the resource available for each type of machine. Maybe I could go
> beyond the limitation in the cluster. I just want to make sure I understand
> correctly that when allocating vcores, it means vcores not the threads.
>
> Thanks a lot.
>
> Best
>
>
>
> On Mon, Oct 24, 2016 at 4:55 PM, Sean Owen <so...@cloudera.com> wrote:
>
> If you're really sure that 4 executors are on 1 machine, then it means
> your resource manager allowed it. What are you using, YARN? check that you
> really are limited to 40 cores per machine in the YARN config.
>
> On Mon, Oct 24, 2016 at 3:33 PM TheGeorge1918 . <zhangxuan1...@gmail.com>
> wrote:
>
> Hi all,
>
> I'm deeply confused by the executor configuration in Spark. I have two
> machines, each with 40 vcores. By mistake, I assign 7 executors and each
> with 11 vcores (It ran without any problem). As a result, one machine has 4
> executors and the other has 3 executors + driver. But this means for the
> machine with 4 executors, it needs 4 x 11 = 44 vcores which is more than 40
> vcores available on that machine. Do I miss something here? Thanks a lot.
>
> aws emr cluster:
> 2 x m4.10xlarge machine, each with 40 vcores, 160G memory
>
> spark:
> num executors: 7
> executor memory: 33G
> num cores: 11
> driver memory: 39G
> driver cores: 6
>
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>

Re: reading info from spark 2.0 application UI

2016-10-24 Thread Sean Owen

If you're really sure that 4 executors are on 1 machine, then it means your
resource manager allowed it. What are you using, YARN? check that you
really are limited to 40 cores per machine in the YARN config.

On Mon, Oct 24, 2016 at 3:33 PM TheGeorge1918 . 
wrote:

> Hi all,
>
> I'm deeply confused by the executor configuration in Spark. I have two
> machines, each with 40 vcores. By mistake, I assign 7 executors and each
> with 11 vcores (It ran without any problem). As a result, one machine has 4
> executors and the other has 3 executors + driver. But this means for the
> machine with 4 executors, it needs 4 x 11 = 44 vcores which is more than 40
> vcores available on that machine. Do I miss something here? Thanks a lot.
>
> aws emr cluster:
> 2 x m4.10xlarge machine, each with 40 vcores, 160G memory
>
> spark:
> num executors: 7
> executor memory: 33G
> num cores: 11
> driver memory: 39G
> driver cores: 6
>
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: pyspark doesn't recognize MMM dateFormat pattern in spark.read.load() for dates like 1989Dec31 and 31Dec1989

2016-10-24 Thread Sean Owen

This is more of an OS-level thing, but I think that if you can manage to
set -Duser.language=en to the JVM, it might do the trick.

I summarized what I think I know about this at
https://issues.apache.org/jira/browse/SPARK-18076 and so we can decide what
to do, if anything, there.

Sean

On Mon, Oct 24, 2016 at 3:08 PM Pietro Pugni <pietro.pu...@gmail.com> wrote:

> Thank you, I’ll appreciate that. I have no experience with Python, Java
> and Spark, so I the question can be translated to: “How can I set JVM
> locale when using spark-submit and pyspark?”. Probably this is possible
> only by changing the system defaul locale and not within the Spark session,
> right?
>
> Thank you
>  Pietro
>
> Il giorno 24 ott 2016, alle ore 14:51, Hyukjin Kwon <gurwls...@gmail.com>
> ha scritto:
>
> I am also interested in this issue. I will try to look into this too
> within coming few days..
>
> 2016-10-24 21:32 GMT+09:00 Sean Owen <so...@cloudera.com>:
>
> I actually think this is a general problem with usage of DateFormat and
> SimpleDateFormat across the code, in that it relies on the default locale
> of the JVM. I believe this needs to, at least, default consistently to
> Locale.US so that behavior is consistent; otherwise it's possible that
> parsing and formatting of dates could work subtly differently across
> environments.
>
> There's a similar question about some code that formats dates for the UI.
> It's more reasonable to let that use the platform-default locale, but, I'd
> still favor standardizing it I think.
>
> Anyway, let me test it out a bit and possibly open a JIRA with this change
> for discussion.
>
> On Mon, Oct 24, 2016 at 1:03 PM pietrop <pietro.pu...@gmail.com> wrote:
>
> Hi there,
> I opened a question on StackOverflow at this link:
>
> http://stackoverflow.com/questions/40007972/pyspark-doesnt-recognize-mmm-dateformat-pattern-in-spark-read-load-for-dates?noredirect=1#comment67297930_40007972
>
> I didn’t get any useful answer, so I’m writing here hoping that someone can
> help me.
>
> In short, I’m trying to read a CSV containing data columns stored using the
> pattern “MMMdd”. What doesn’t work for me is “MMM”. I’ve done some
> testing and discovered that it’s a localization issue. As you can read from
> the StackOverflow question, I run a simple Java code to parse the date
> “1989Dec31” and it works only if I specify Locale.US in the
> SimpleDateFormat() function.
>
> I would like pyspark to work. I tried setting a different local from
> console
> (LANG=“en_US”), but it doesn’t work. I tried also setting it using the
> locale package from Python.
>
> So, there’s a way to set locale in Spark when using pyspark? The issue is
> Java related and not Python related (the function that parses data is
> invoked by spark.read.load(dateFormat=“MMMdd”, …). I don’t want to use
> other solutions in order to encode data because they are slower (from what
> I’ve seen so far).
>
> Thank you
> Pietro
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-doesn-t-recognize-MMM-dateFormat-pattern-in-spark-read-load-for-dates-like-1989Dec31-and-31D9-tp27951.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com
> <http://nabble.com>.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>
>

Re: pyspark doesn't recognize MMM dateFormat pattern in spark.read.load() for dates like 1989Dec31 and 31Dec1989

2016-10-24 Thread Sean Owen

I actually think this is a general problem with usage of DateFormat and
SimpleDateFormat across the code, in that it relies on the default locale
of the JVM. I believe this needs to, at least, default consistently to
Locale.US so that behavior is consistent; otherwise it's possible that
parsing and formatting of dates could work subtly differently across
environments.

There's a similar question about some code that formats dates for the UI.
It's more reasonable to let that use the platform-default locale, but, I'd
still favor standardizing it I think.

Anyway, let me test it out a bit and possibly open a JIRA with this change
for discussion.

On Mon, Oct 24, 2016 at 1:03 PM pietrop  wrote:

Hi there,
I opened a question on StackOverflow at this link:
http://stackoverflow.com/questions/40007972/pyspark-doesnt-recognize-mmm-dateformat-pattern-in-spark-read-load-for-dates?noredirect=1#comment67297930_40007972

I didn’t get any useful answer, so I’m writing here hoping that someone can
help me.

In short, I’m trying to read a CSV containing data columns stored using the
pattern “MMMdd”. What doesn’t work for me is “MMM”. I’ve done some
testing and discovered that it’s a localization issue. As you can read from
the StackOverflow question, I run a simple Java code to parse the date
“1989Dec31” and it works only if I specify Locale.US in the
SimpleDateFormat() function.

I would like pyspark to work. I tried setting a different local from console
(LANG=“en_US”), but it doesn’t work. I tried also setting it using the
locale package from Python.

So, there’s a way to set locale in Spark when using pyspark? The issue is
Java related and not Python related (the function that parses data is
invoked by spark.read.load(dateFormat=“MMMdd”, …). I don’t want to use
other solutions in order to encode data because they are slower (from what
I’ve seen so far).

Thank you
Pietro



--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-doesn-t-recognize-MMM-dateFormat-pattern-in-spark-read-load-for-dates-like-1989Dec31-and-31D9-tp27951.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark Streaming 2 Kafka 0.10 Integration for Aggregating Data

2016-10-18 Thread Sean Owen

Try adding the spark-streaming_2.11 artifact as a dependency too. You will
be directly depending on it.

On Tue, Oct 18, 2016 at 2:16 PM Furkan KAMACI 
wrote:

> Hi,
>
> I have a search application and want to monitor queries per second for it.
> I have Kafka at my backend which acts like a bus for messages. Whenever a
> search request is done I publish the nano time of the current system. I
> want to use Spark Streaming to aggregate such data but I am so new to it.
>
> I wanted to follow that example:
> http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html
>
> I've added that dependencies:
>
> 
> org.apache.spark
> spark-streaming-kafka-0-10_2.11
> 2.0.1
> 
> 
> org.apache.spark
> spark-core_2.10
> 2.0.1
> 
>
> However I cannot see even Duration class at my dependencies. On the other
> hand given documentation is missing and when you click Java there is no
> code at tabs.
>
> Could you guide me how can I implement monitoring such a metric?
>
> Kind Regards,
> Furkan KAMACI
>

Re: Couchbase-Spark 2.0.0

2016-10-17 Thread Sean Owen

You're now asking about couchbase code, so this isn't the best place to
ask. Head to couchbase forums.

On Mon, Oct 17, 2016 at 10:14 AM Devi P.V  wrote:

> Hi,
> I tried with the following code
>
> import com.couchbase.spark._
> val conf = new SparkConf()
>   .setAppName(this.getClass.getName)
>   .setMaster("local[*]")
>   .set("com.couchbase.bucket.bucketName","password")
>   .set("com.couchbase.nodes", "node")
> .set ("com.couchbase.queryEnabled", "true")
> val sc = new SparkContext(conf)
>
> I need full document from bucket,so i gave query like this,
>
> val query = "SELECT META(`bucketName`).id as id FROM `bucketName` "
>  sc
>   .couchbaseQuery(Query.simple(query))
>   .map(_.value.getString("id"))
>   .couchbaseGet[JsonDocument]()
>   .collect()
>   .foreach(println)
>
> But it can't take Query.simple(query)
>
> I used libraryDependencies += "com.couchbase.client" %
> "spark-connector_2.11" % "1.2.1" in built.sbt.
> Is my query wrong or anything else needed to import?
>
>
> Please help.
>
> On Sun, Oct 16, 2016 at 8:23 PM, Rodrick Brown <
> rodr...@orchardplatform.com> wrote:
>
>
>
> On Sun, Oct 16, 2016 at 10:51 AM, Devi P.V  wrote:
>
> Hi all,
> I am trying to read data from couchbase using spark 2.0.0.I need to fetch
> complete data from a bucket as  Rdd.How can I solve this?Does spark 2.0.0
> support couchbase?Please help.
>
> Thanks
>
> https://github.com/couchbase/couchbase-spark-connector
>
>
> --
>
> [image: Orchard Platform] 
>
> *Rodrick Brown */ *DevOPs*
>
> 9174456839 / rodr...@orchardplatform.com
>
> Orchard Platform
> 101 5th Avenue, 4th Floor, New York, NY
>
> *NOTICE TO RECIPIENTS*: This communication is confidential and intended
> for the use of the addressee only. If you are not an intended recipient of
> this communication, please delete it immediately and notify the sender by
> return email. Unauthorized reading, dissemination, distribution or copying
> of this communication is prohibited. This communication does not constitute
> an offer to sell or a solicitation of an indication of interest to purchase
> any loan, security or any other financial product or instrument, nor is it
> an offer to sell or a solicitation of an indication of interest to purchase
> any products or services to any persons who are prohibited from receiving
> such information under applicable law. The contents of this communication
> may not be accurate or complete and are subject to change without notice.
> As such, Orchard App, Inc. (including its subsidiaries and affiliates,
> "Orchard") makes no representation regarding the accuracy or completeness
> of the information contained herein. The intended recipient is advised to
> consult its own professional advisors, including those specializing in
> legal, tax and accounting matters. Orchard does not provide legal, tax or
> accounting advice.
>
>
>

Re: Question about the offiicial binary Spark 2 package

2016-10-17 Thread Sean Owen

You can take the "with user-provided Hadoop" binary from the download page,
and yes that should mean it does not drag in a Hive dependency of its own.

On Mon, Oct 17, 2016 at 7:08 AM Xi Shen  wrote:

> Hi,
>
> I want to configure my Hive to use Spark 2 as its engine. According to
> Hive's instruction, the Spark should build *without *Hadoop, nor Hive. I
> could build my own, but for some reason I hope I could use a official
> binary build.
>
> So I want to ask if the official Spark binary build labeled "with
> user-provided Hadoop" also implies "user-provided Hive".
>
> --
>
>
> Thanks,
> David S.
>

Re: Possible memory leak after closing spark context in v2.0.1

2016-10-17 Thread Sean Owen

Did you unpersist the broadcast objects?

On Mon, Oct 17, 2016 at 10:02 AM lev  wrote:

> Hello,
>
> I'm in the process of migrating my application to spark 2.0.1,
> And I think there is some memory leaks related to Broadcast joins.
>
> the application has many unit tests,
> and each individual test suite passes, but when running all together, it
> fails on OOM errors.
>
> In the begging of each suite I create a new spark session with the session
> builder:
> /val spark = sessionBuilder.getOrCreate()
> /
> and in the end of each suite, I call the stop method:
> /spark.stop()/
>
> I added a profiler to the application, and looks like broadcast objects are
> taking most of the memory:
> <
> http://apache-spark-user-list.1001560.n3.nabble.com/file/n27910/profiler.png
> >
>
> Since each test suite passes when running by itself,
> I think that the broadcasts are leaking between the tests suites.
>
> Any suggestions on how to resolve this?
>
> thanks
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Possible-memory-leak-after-closing-spark-context-in-v2-0-1-tp27910.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: Resizing Image with Scrimage in Spark

2016-10-17 Thread Sean Owen

It pretty much means what it says. Objects you send across machines must be
serializable, and the object from the library is not.
You can write a wrapper object that is serializable and knows how to
serialize it. Or ask the library dev to consider making this object
serializable.

On Mon, Oct 17, 2016 at 8:04 AM Adline Dsilva 
wrote:

> Hi All,
>
>I have a Hive Table which contains around 500 million photos(Profile
> picture of Users) stored as hex string and total size of the table is 5TB.
> I'm trying to make a solution where images can be retrieved in real-time.
>
> Current Solution,  Resize the images, index it along the user profile to
> solr. For Resizing, Im using a scala library called scrimage
> 
>
> While running the udf function im getting below error.
> Serialization stack:
> - *object not serializable* (class: com.sksamuel.scrimage.Image,
> value: Image [width=767, height=1024, type=2])
> - field (class: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC, name: imgR,
> type: class com.sksamuel.scrimage.Image)
>
> Can anyone suggest method to overcome the above error.
>
> Regards,
> Adline
>
> --
> *DISCLAIMER:*
>
> This e-mail (including any attachments) is for the addressee(s) only and
> may be confidential, especially as regards personal data. If you are not
> the intended recipient, please note that any dealing, review, distribution,
> printing, copying or use of this e-mail is strictly prohibited. If you have
> received this email in error, please notify the sender immediately and
> delete the original message (including any attachments).
>
> MIMOS Berhad is a research and development institution under the purview
> of the Malaysian Ministry of Science, Technology and Innovation. Opinions,
> conclusions and other information in this e-mail that do not relate to the
> official business of MIMOS Berhad and/or its subsidiaries shall be
> understood as neither given nor endorsed by MIMOS Berhad and/or its
> subsidiaries and neither MIMOS Berhad nor its subsidiaries accepts
> responsibility for the same. All liability arising from or in connection
> with computer viruses and/or corrupted e-mails is excluded to the fullest
> extent permitted by law.
>

Re: 回复：Spark-submit Problems

2016-10-16 Thread Sean Owen

Is it just a typo in the email or are you missing a space after your
--master argument?


The logs here actually don't say much but "something went wrong". It seems
fairly low-level, like the gateway process failed or didn't start, rather
than a problem with the program. It's hard to say more unless you can dig
out any more logs, like from the worker, executor?

On Sun, Oct 16, 2016 at 4:24 AM Tobi Bosede  wrote:

Hi Mekal, thanks for wanting to help. I have attached the python script as
well as the different exceptions here. I have also pasted the cluster
exception below so I can highlight the relevant parts.


[abosede2@badboy ~]$ spark-submit --master spark://10.160.5.48:7077
trade_data_count.py
Ivy Default Cache set to: /home/abosede2/.ivy2/cache
The jars for the packages stored in: /home/abosede2/.ivy2/jars
:: loading settings :: url =
jar:file:/usr/local/spark-1.6.1/assembly/target/scala-2.11/spark-assembly-1.6.1-hre/settings/ivysettings.xml
com.databricks#spark-csv_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found com.databricks#spark-csv_2.11;1.3.0 in central
found org.apache.commons#commons-csv;1.1 in central
found com.univocity#univocity-parsers;1.5.1 in central
:: resolution report :: resolve 160ms :: artifacts dl 7ms
:: modules in use:
com.databricks#spark-csv_2.11;1.3.0 from central in [default]
com.univocity#univocity-parsers;1.5.1 from central in [default]
org.apache.commons#commons-csv;1.1 from central in [default]
-
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
-
| default | 3 | 0 | 0 | 0 || 3 | 0 |
-
:: retrieving :: org.apache.spark#spark-submit-parent
confs: [default]
0 artifacts copied, 3 already retrieved (0kB/6ms)
[Stage 0:> (104 + 8) /235]16/10/15 19:42:37 ERROR
TaskScadboy.win.ad.jhu.edu : Remote RPC
client disassociated. Likely due to containers exceeding thresholds, or
netwoWARN messages.
16/10/15 19:42:37 ERROR SparkDeploySchedulerBackend: Application has been
killed. Reason: Master removed our a
[Stage 0:===> (104 + -28) / 235]Traceback (most recent
call la
File "/home/abosede2/trade_data_count.py", line 79, in 
print("Raw data is %d rows." % data.count())
File
"/usr/local/spark-1.6.1/python/lib/pyspark.zip/pyspark/sql/dataframe.py",
line 269, in count
File
"/usr/lib/python2.7/site-packages/py4j-0.9.2-py2.7.egg/py4j/java_gateway.py",
line 836, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/local/spark-1.6.1/python/lib/pyspark.zip/pyspark/sql/utils.py",
line 45, in deco
File
"/usr/lib/python2.7/site-packages/py4j-0.9.2-py2.7.egg/py4j/protocol.py",
line 310, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o6867.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Master
removed our application: FAILED
at org.apache.spark.scheduler.DAGScheduler.org

$apache$spark$scheduler$DAGScheduler$$failJobAndIndepend)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799
at scala.Option.foreach(Option.scala:257)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:927)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at

Re: OOM when running Spark SQL by PySpark on Java 8

2016-10-13 Thread Sean Owen

You can specify it; it just doesn't do anything but cause a warning in Java
8. It won't work in general to have such a tiny PermGen. If it's working it
means you're on Java 8 because it's ignored. You should set MaxPermSize if
anything, not PermSize. However the error indicates you are not using Java
8 everywhere on your cluster, and that's a potentially bigger problem.

On Thu, Oct 13, 2016 at 10:26 AM Shady Xu  wrote:

> Solved the problem by specifying the PermGen size when submitting the job
> (even to just a few MB).
>
> Seems Java 8 has removed the Permanent Generation space, thus
> corresponding JVM arguments are ignored.  But I can still
> use --driver-java-options "-XX:PermSize=80M -XX:MaxPermSize=100m" to
> specify them when submitting the Spark job, which is wried. I don't know
> whether it has anything to do with py4j as I am not familiar with it.
>
> 2016-10-13 17:00 GMT+08:00 Shady Xu :
>
> Hi,
>
> I have a problem when running Spark SQL by PySpark on Java 8. Below is the
> log.
>
>
> 16/10/13 16:46:40 INFO spark.SparkContext: Starting job: sql at 
> NativeMethodAccessorImpl.java:-2
> Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: 
> PermGen space
>   at java.lang.ClassLoader.defineClass1(Native Method)
>   at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
>   at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
>   at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
>   at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:857)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1630)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1622)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1611)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> Exception in thread "shuffle-server-2" java.lang.OutOfMemoryError: PermGen 
> space
> Exception in thread "shuffle-server-4" java.lang.OutOfMemoryError: PermGen 
> space
> Exception in thread "threadDeathWatcher-2-1" java.lang.OutOfMemoryError: 
> PermGen space
>
>
> I tried to increase the driver memory and didn't help. However, things are ok 
> when I run the same code after switching to Java 7. I also find it ok to run 
> the SparkPi example on Java 8. So I believe the problem stays with PySpark 
> rather theSpark core.
>
>
> I am using Spark 2.0.1 and run the program in YARN cluster mode. Anyone any 
> idea is appreciated.
>
>
>

Re: OOM when running Spark SQL by PySpark on Java 8

2016-10-13 Thread Sean Owen

The error doesn't say you're out of memory, but says you're out of PermGen.
If you see this, you aren't running Java 8 AFAIK, because 8 has no PermGen.
But if you're running Java 7, and you go investigate what this error means,
you'll find you need to increase PermGen. This is mentioned in the Spark
docs too, as you need to increase this when building on Java 7.

On Thu, Oct 13, 2016 at 10:00 AM Shady Xu  wrote:

> Hi,
>
> I have a problem when running Spark SQL by PySpark on Java 8. Below is the
> log.
>
>
> 16/10/13 16:46:40 INFO spark.SparkContext: Starting job: sql at 
> NativeMethodAccessorImpl.java:-2
> Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: 
> PermGen space
>   at java.lang.ClassLoader.defineClass1(Native Method)
>   at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
>   at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
>   at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
>   at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:857)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1630)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1622)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1611)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> Exception in thread "shuffle-server-2" java.lang.OutOfMemoryError: PermGen 
> space
> Exception in thread "shuffle-server-4" java.lang.OutOfMemoryError: PermGen 
> space
> Exception in thread "threadDeathWatcher-2-1" java.lang.OutOfMemoryError: 
> PermGen space
>
>
> I tried to increase the driver memory and didn't help. However, things are ok 
> when I run the same code after switching to Java 7. I also find it ok to run 
> the SparkPi example on Java 8. So I believe the problem stays with PySpark 
> rather theSpark core.
>
>
> I am using Spark 2.0.1 and run the program in YARN cluster mode. Anyone any 
> idea is appreciated.
>
>

Re: Want to test spark-sql-kafka but get unresolved dependency error

2016-10-13 Thread Sean Owen

I don't believe that's been released yet. It looks like it was merged into
branches about a week ago. You're looking at unreleased docs too - have a
look at http://spark.apache.org/docs/latest/ for the latest released docs.

On Thu, Oct 13, 2016 at 9:24 AM JayKay  wrote:

> I want to work with the Kafka integration for structured streaming. I use
> Spark version 2.0.0. and I start the spark-shell with:
>
> spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.0.0
>
> As described here:
>
> https://github.com/apache/spark/blob/master/docs/structured-streaming-kafka-integration.md
>
> But I get a unresolved dependency error ("unresolved dependency:
> org.apache.spark#spark-sql-kafka-0-10_2.11;2.0.0: not found"). So it seems
> not to be available via maven or spark-packages.
>
> How can I accesss this package? Or am I doing something wrong/missing?
>
> Thank you for you help.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Want-to-test-spark-sql-kafka-but-get-unresolved-dependency-error-tp27891.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: Linear Regression Error

2016-10-12 Thread Sean Owen

See https://issues.apache.org/jira/browse/SPARK-17588

On Wed, Oct 12, 2016 at 9:07 PM Meeraj Kunnumpurath <
mee...@servicesymphony.com> wrote:

> If I drop the last feature on the third model, the error seems to go away.
>
> On Wed, Oct 12, 2016 at 11:52 PM, Meeraj Kunnumpurath <
> mee...@servicesymphony.com> wrote:
>
> Hello,
>
> I have some code trying to compare linear regression coefficients with
> three sets of features, as shown below. On the third one, I get an
> assertion error.
>
> This is the code,
>
> object MultipleRegression extends App {
>
>
>
>   val spark = SparkSession.builder().appName("Regression Model 
> Builder").master("local").getOrCreate()
>
>   import spark.implicits._
>
>   val training = build("kc_house_train_data.csv", "train", spark)
>   val test = build("kc_house_test_data.csv", "test", spark)
>
>   val lr = new LinearRegression()
>
>   val m1 = lr.fit(training.map(r => buildLp(r, "sqft_living", "bedrooms", 
> "bathrooms", "lat", "long")))
>   println(s"Coefficients: ${m1.coefficients}, Intercept: ${m1.intercept}")
>
>   val m2 = lr.fit(training.map(r => buildLp(r, "sqft_living", "bedrooms", 
> "bathrooms", "lat", "long", "bed_bath_rooms")))
>   println(s"Coefficients: ${m2.coefficients}, Intercept: ${m2.intercept}")
>
>   val m3 = lr.fit(training.map(r => buildLp(r, "sqft_living", "bedrooms", 
> "bathrooms", "lat", "long", "bed_bath_rooms", "bedrooms_squared", 
> "log_sqft_living", "lat_plus_long")))
>   println(s"Coefficients: ${m3.coefficients}, Intercept: ${m3.intercept}")
>
>
>   def build(path: String, view: String, spark: SparkSession) = {
>
> val toDouble = udf((x: String) => x.toDouble)
> val product = udf((x: Double, y: Double) => x * y)
> val sum = udf((x: Double, y: Double) => x + y)
> val log = udf((x: Double) => scala.math.log(x))
>
> spark.read.
>   option("header", "true").
>   csv(path).
>   withColumn("sqft_living", toDouble('sqft_living)).
>   withColumn("price", toDouble('price)).
>   withColumn("bedrooms", toDouble('bedrooms)).
>   withColumn("bathrooms", toDouble('bathrooms)).
>   withColumn("lat", toDouble('lat)).
>   withColumn("long", toDouble('long)).
>   withColumn("bedrooms_squared", product('bedrooms, 'bedrooms)).
>   withColumn("bed_bath_rooms", product('bedrooms, 'bathrooms)).
>   withColumn("lat_plus_long", sum('lat, 'long)).
>   withColumn("log_sqft_living", log('sqft_living))
>
>   }
>
>   def buildLp(r: Row, input: String*) = {
> var features = input.map(r.getAs[Double](_)).toArray
> new LabeledPoint(r.getAs[Double]("price"), Vectors.dense(features))
>   }
>
> }
>
>
> This is the error I get.
>
> Exception in thread "main" java.lang.AssertionError: assertion failed:
> lapack.dppsv returned 9.
> at scala.Predef$.assert(Predef.scala:170)
> at
> org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:40)
> at
> org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:140)
> at org.apache.spark.ml
> .regression.LinearRegression.train(LinearRegression.scala:180)
> at org.apache.spark.ml
> .regression.LinearRegression.train(LinearRegression.scala:70)
> at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
> at
> com.ss.ml.regression.MultipleRegression$.delayedEndpoint$com$ss$ml$regression$MultipleRegression$1(MultipleRegression.scala:36)
> at
> com.ss.ml.regression.MultipleRegression$delayedInit$body.apply(MultipleRegression.scala:12)
> at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
> at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
> at scala.App$$anonfun$main$1.apply(App.scala:76)
> at scala.App$$anonfun$main$1.apply(App.scala:76)
> at scala.collection.immutable.List.foreach(List.scala:381)
> at
> scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
> at scala.App$class.main(App.scala:76)
> at
> com.ss.ml.regression.MultipleRegression$.main(MultipleRegression.scala:12)
> at com.ss.ml.regression.MultipleRegression.main(MultipleRegression.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:483)
> at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)
>
>
> Does anyone know what is going wrong here?
>
> Many thanks
>
> --
> *Meeraj Kunnumpurath*
>
>
> *Director and Executive PrincipalService Symphony Ltd00 44 7702 693597
> <07702%20693597>*
>
> *00 971 50 409 0169 <+971%2050%20409%200169>mee...@servicesymphony.com
> *
>
>
>
>
> --
> *Meeraj Kunnumpurath*
>
>
> *Director and Executive PrincipalService Symphony Ltd00 44 7702 693597
> <07702%20693597>*
>
> *00 971 50 409 0169 <+971%2050%20409%200169>mee...@servicesymphony.com
> *
>

Re: mllib model in production web API

2016-10-11 Thread Sean Owen

I don't believe it will ever scale to spin up a whole distributed job to
serve one request. You can look possibly at the bits in mllib-local. You
might do well to export as something like PMML either with Spark's export
or JPMML and then load it into a web container and score it, without Spark
(possibly also with JPMML, OpenScoring)

On Tue, Oct 11, 2016, 17:53 Nicolas Long  wrote:

> Hi all,
>
> so I have a model which has been stored in S3. And I have a Scala webapp
> which for certain requests loads the model and transforms submitted data
> against it.
>
> I'm not sure how to run this quickly on a single instance though. At the
> moment Spark is being bundled up with the web app in an uberjar (sbt
> assembly).
>
> But the process is quite slow. I'm aiming for responses < 1 sec so that
> the webapp can respond quickly to requests. When I look the Spark UI I see:
>
> Summary Metrics for 1 Completed Tasks
>
> MetricMin25th percentileMedian75th percentileMax
> Duration94 ms94 ms94 ms94 ms94 ms
> Scheduler Delay0 ms0 ms0 ms0 ms0 ms
> Task Deserialization Time3 s3 s3 s3 s3 s
> GC Time2 s2 s2 s2 s2 s
> Result Serialization Time0 ms0 ms0 ms0 ms0 ms
> Getting Result Time0 ms0 ms0 ms0 ms0 ms
> Peak Execution Memory0.0 B0.0 B0.0 B0.0 B0.0 B
>
> I don't really understand why deserialization and GC should take so long
> when the models are already loaded. Is this evidence I am doing something
> wrong? And where can I get a better understanding on how Spark works under
> the hood here, and how best to do a standalone/bundled jar deployment?
>
> Thanks!
>
> Nic
>

Re: Pls assist: Spark 2.0 build failure on Ubuntu 16.06

2016-10-01 Thread Sean Owen

"Compile failed via zinc server"

Try shutting down zinc. Something's funny about your compile server.
It's not required anyway.

On Sat, Oct 1, 2016 at 3:24 PM, Marco Mistroni  wrote:
> Hi guys
>  sorry to annoy you on this but i am getting nowhere. So far i have tried to
> build spark 2.0 on my local laptop with no success so i blamed my
> laptop poor performance
> So today i fired off an EC2 Ubuntu 16.06 Instance and installed the
> following (i copy paste commands here)
>
> ubuntu@ip-172-31-40-104:~/spark$ history
> sudo apt-get install -y python-software-properties
> sudo apt-get install -y software-properties-common
> sudo add-apt-repository -y ppa:webupd8team/java
> sudo apt-get install -y oracle-java8-installer
> sudo apt-get install -y git
> git clone git://github.com/apache/spark.git
> cd spark
>
> Then launched the following commands:
> First thi sone, with only yarn, as i dont need hadoop
>
>  ./build/mvn -Pyarn  -DskipTests clean package
>
> This failed. after kicking off same command with -X i got this
>
> ERROR] Failed to execute goal
> net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first) on
> project spark-core_2.11: Execution scala-compile-first of goal
> net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed. CompileFailed ->
> [Help 1]
> org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
> goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile
> (scala-compile-first) on project spark-core_2.11: Execution
> scala-compile-first of goal
> net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed.
> at
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:212)
> at
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
> at
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
> at
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116)
> at
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80)
> at
> org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51)
> at
> org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
> at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:307)
> at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:193)
> at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:106)
> at org.apache.maven.cli.MavenCli.execute(MavenCli.java:863)
> at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:288)
> at org.apache.maven.cli.MavenCli.main(MavenCli.java:199)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
> Caused by: org.apache.maven.plugin.PluginExecutionException: Execution
> scala-compile-first of goal
> net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed.
> at
> org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:145)
> at
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:207)
> ... 20 more
> Caused by: Compile failed via zinc server
> at
> sbt_inc.SbtIncrementalCompiler.zincCompile(SbtIncrementalCompiler.java:136)
> at
> sbt_inc.SbtIncrementalCompiler.compile(SbtIncrementalCompiler.java:86)
> at
> scala_maven.ScalaCompilerSupport.incrementalCompile(ScalaCompilerSupport.java:303)
> at
> scala_maven.ScalaCompilerSupport.compile(ScalaCompilerSupport.java:119)
> at
> scala_maven.ScalaCompilerSupport.doExecute(ScalaCompilerSupport.java:99)
> at scala_maven.ScalaMojoSupport.execute(ScalaMojoSupport.java:482)
> at
> org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:134)
> ... 21 more
>
>
>
> Then i tried to use hadoop even though i dont need hadoop for my code
>
> ./build/mvn -Pyarn -Phadoop-2.4  -DskipTests clean package
>
> This failed again, at exactly the same point, with the same error
>
> Then i thought maybe i used an old version of hadoop, so i tried to use 2.7
>
>   ./build/mvn -Pyarn -Phadoop-2.7  -DskipTests clean

Re: Using Spark as a Maven dependency but with Hadoop 2.6

2016-09-29 Thread Sean Owen

No, I think that's what dependencyManagent (or equivalent) is definitely for.

On Thu, Sep 29, 2016 at 5:37 AM, Olivier Girardot
 wrote:
> I know that the code itself would not be the same, but it would be useful to
> at least have the pom/build.sbt transitive dependencies different when
> fetching the artifact with a specific classifier, don't you think ?
> For now I've overriden them myself using the dependency versions defined in
> the pom.xml of spark.
> So it's not a blocker issue, it may be useful to document it, but a blog
> post would be sufficient I think.
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Using Spark as a Maven dependency but with Hadoop 2.6

2016-09-28 Thread Sean Owen

I guess I'm claiming the artifacts wouldn't even be different in the first
place, because the Hadoop APIs that are used are all the same across these
versions. That would be the thing that makes you need multiple versions of
the artifact under multiple classifiers.

On Wed, Sep 28, 2016 at 1:16 PM, Olivier Girardot <
o.girar...@lateral-thoughts.com> wrote:

> ok, don't you think it could be published with just different classifiers
> hadoop-2.6
> hadoop-2.4
> hadoop-2.2 being the current default.
>
> So for now, I should just override spark 2.0.0's dependencies with the
> ones defined in the pom profile
>
>
>
> On Thu, Sep 22, 2016 11:17 AM, Sean Owen so...@cloudera.com wrote:
>
>> There can be just one published version of the Spark artifacts and they
>> have to depend on something, though in truth they'd be binary-compatible
>> with anything 2.2+. So you merely manage the dependency versions up to the
>> desired version in your .
>>
>> On Thu, Sep 22, 2016 at 7:05 AM, Olivier Girardot <
>> o.girar...@lateral-thoughts.com> wrote:
>>
>> Hi,
>> when we fetch Spark 2.0.0 as maven dependency then we automatically end
>> up with hadoop 2.2 as a transitive dependency, I know multiple profiles are
>> used to generate the different tar.gz bundles that we can download, Is
>> there by any chance publications of Spark 2.0.0 with different classifier
>> according to different versions of Hadoop available ?
>>
>> Thanks for your time !
>>
>> *Olivier Girardot*
>>
>>
>>
>
> *Olivier Girardot* | Associé
> o.girar...@lateral-thoughts.com
> +33 6 24 09 17 94
>

Re: Large-scale matrix inverse in Spark

2016-09-27 Thread Sean Owen

I don't recall any code in Spark that computes a matrix inverse. There is
code that solves linear systems Ax = b with a decomposition. For example
from looking at the code recently, I think the regression implementation
actually solves AtAx = Atb using a Cholesky decomposition. But, A = n x k,
where n is large but k is smallish (number of features), so AtA is k x k
and can be solved in-memory with a library.

On Tue, Sep 27, 2016 at 3:05 AM, Cooper  wrote:
> How is the problem of large-scale matrix inversion approached in Apache
Spark
> ?
>
> This linear algebra operation is obviously the very base of a lot of other
> algorithms (regression, classification, etc). However, I have not been
able
> to find a Spark API on parallel implementation of matrix inversion. Can
you
> please clarify approaching this operation on the Spark internals ?
>
> Here    is a paper
on
> the parallelized matrix inversion in Spark, however I am trying to use an
> existing code instead of implementing one from scratch, if available.
>
>
>
> --
> View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Large-scale-matrix-inverse-in-Spark-tp27796.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>

Re: MLib Documentation Update Needed

2016-09-26 Thread Sean Owen

Yes I think that footnote could be a lot more prominent, or pulled up
right under the table.

I also think it would be fine to present the {0,1} formulation. It's
actually more recognizable, I think, for log-loss in that form. It's
probably less recognizable for hinge loss, but, consistency is more
important. There's just an extra (2y-1) term, at worst.

The loss here is per instance, and implicitly summed over all
instances. I think that is probably not confusing for the reader; if
they're reading this at all to double-check just what formulation is
being used, I think they'd know that. But, it's worth a note.

The loss is summed in the case of log-loss, not multiplied (if that's
what you're saying).

Those are decent improvements, feel free to open a pull request / JIRA.

On Mon, Sep 26, 2016 at 6:22 AM, Tobi Bosede  wrote:
> The loss function here for logistic regression is confusing. It seems to
> imply that spark uses only -1 and 1 class labels. However it uses 0,1 as the
> very inconspicuous note quoted below (under Classification) says. We need to
> make this point more visible to avoid confusion.
>
> Better yet, we should replace the loss function listed with that for 0, 1 no
> matter how mathematically inconvenient, since that is what is actually
> implemented in Spark.
>
> More problematic, the loss function (even in this "convenient" form) is
> actually incorrect. This is because it is missing either a summation (sigma)
> in the log or product (pi) outside the log, as the loss for logistic is the
> log likelihood. So there are multiple problems with the documentation.
> Please advise on steps to fix for all version documentation or if there are
> already some in place.
>
> "Note that, in the mathematical formulation in this guide, a binary label
> y is denoted as either +1 (positive) or −1 (negative), which is convenient
> for the formulation. However, the negative label is represented by 0 in
> spark.mllib instead of −1, to be consistent with multiclass labeling."

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Off Heap (Tungsten) Memory Usage / Management ?

2016-09-22 Thread Sean Owen

I don't think I'd enable swap on a cluster. You'd rather processes
fail than grind everything to a halt. You'd buy more memory or
optimize memory before trading it for I/O.

On Thu, Sep 22, 2016 at 6:29 PM, Michael Segel
 wrote:
> Ok… gotcha… wasn’t sure that YARN just looked at the heap size allocation and 
> ignored the off heap.
>
> WRT over all OS memory… this would be one reason why I’d keep a decent amount 
> of swap around. (Maybe even putting it on a fast device like an .m2 or PCIe 
> flash drive….

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Off Heap (Tungsten) Memory Usage / Management ?

2016-09-22 Thread Sean Owen

It's looking at the whole process's memory usage, and doesn't care
whether the memory is used by the heap or not within the JVM. Of
course, allocating memory off-heap still counts against you at the OS
level.

On Thu, Sep 22, 2016 at 3:54 PM, Michael Segel
<msegel_had...@hotmail.com> wrote:
> Thanks for the response Sean.
>
> But how does YARN know about the off-heap memory usage?
> That’s the piece that I’m missing.
>
> Thx again,
>
> -Mike
>
>> On Sep 21, 2016, at 10:09 PM, Sean Owen <so...@cloudera.com> wrote:
>>
>> No, Xmx only controls the maximum size of on-heap allocated memory.
>> The JVM doesn't manage/limit off-heap (how could it? it doesn't know
>> when it can be released).
>>
>> The answer is that YARN will kill the process because it's using more
>> memory than it asked for. A JVM is always going to use a little
>> off-heap memory by itself, so setting a max heap size of 2GB means the
>> JVM process may use a bit more than 2GB of memory. With an off-heap
>> intensive app like Spark it can be a lot more.
>>
>> There's a built-in 10% overhead, so that if you ask for a 3GB executor
>> it will ask for 3.3GB from YARN. You can increase the overhead.
>>
>> On Wed, Sep 21, 2016 at 11:41 PM, Jörn Franke <jornfra...@gmail.com> wrote:
>>> All off-heap memory is still managed by the JVM process. If you limit the
>>> memory of this process then you limit the memory. I think the memory of the
>>> JVM process could be limited via the xms/xmx parameter of the JVM. This can
>>> be configured via spark options for yarn (be aware that they are different
>>> in cluster and client mode), but i recommend to use the spark options for
>>> the off heap maximum.
>>>
>>> https://spark.apache.org/docs/latest/running-on-yarn.html
>>>
>>>
>>> On 21 Sep 2016, at 22:02, Michael Segel <msegel_had...@hotmail.com> wrote:
>>>
>>> Iâ€™ve asked this question a couple of times from a friend who didnâ€™t know
>>> the answerâ€¦ so I thought I would try here.
>>>
>>>
>>> Suppose we launch a job on a cluster (YARN) and we have set up the
>>> containers to be 3GB in size.
>>>
>>>
>>> What does that 3GB represent?
>>>
>>> I mean what happens if we end up using 2-3GB of off heap storage via
>>> tungsten?
>>> What will Spark do?
>>> Will it try to honor the containerâ€™s limits and throw an exception or will
>>> it allow my job to grab that amount of memory and exceed YARNâ€™s
>>> expectations since its off heap?
>>>
>>> Thx
>>>
>>> -Mike
>>>
>>> B‹CB• È
>>> [œÝXœØÜšX™H K[XZ[ ˆ \Ù\‹][œÝXœØÜšX™P Ü \šË˜\ XÚ K›Ü™ÃBƒ
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Open source Spark based projects

2016-09-22 Thread Sean Owen

https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects
and maybe related ...
https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark

On Thu, Sep 22, 2016 at 11:15 AM, tahirhn  wrote:
> I am planning to write a thesis on certain aspects (i.e testing, performance
> optimisation, security) of Apache Spark. I need to study some projects that
> are based on Apache Spark and are available as open source.
>
> If you know any such project (open source Spark based project), Please share
> it here. Thanks
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Open-source-Spark-based-projects-tp27778.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Using Spark as a Maven dependency but with Hadoop 2.6

2016-09-22 Thread Sean Owen

There can be just one published version of the Spark artifacts and they
have to depend on something, though in truth they'd be binary-compatible
with anything 2.2+. So you merely manage the dependency versions up to the
desired version in your .

On Thu, Sep 22, 2016 at 7:05 AM, Olivier Girardot <
o.girar...@lateral-thoughts.com> wrote:

> Hi,
> when we fetch Spark 2.0.0 as maven dependency then we automatically end up
> with hadoop 2.2 as a transitive dependency, I know multiple profiles are
> used to generate the different tar.gz bundles that we can download, Is
> there by any chance publications of Spark 2.0.0 with different classifier
> according to different versions of Hadoop available ?
>
> Thanks for your time !
>
> *Olivier Girardot*
>

Re: Off Heap (Tungsten) Memory Usage / Management ?

2016-09-21 Thread Sean Owen

No, Xmx only controls the maximum size of on-heap allocated memory.
The JVM doesn't manage/limit off-heap (how could it? it doesn't know
when it can be released).

The answer is that YARN will kill the process because it's using more
memory than it asked for. A JVM is always going to use a little
off-heap memory by itself, so setting a max heap size of 2GB means the
JVM process may use a bit more than 2GB of memory. With an off-heap
intensive app like Spark it can be a lot more.

There's a built-in 10% overhead, so that if you ask for a 3GB executor
it will ask for 3.3GB from YARN. You can increase the overhead.

On Wed, Sep 21, 2016 at 11:41 PM, Jörn Franke  wrote:
> All off-heap memory is still managed by the JVM process. If you limit the
> memory of this process then you limit the memory. I think the memory of the
> JVM process could be limited via the xms/xmx parameter of the JVM. This can
> be configured via spark options for yarn (be aware that they are different
> in cluster and client mode), but i recommend to use the spark options for
> the off heap maximum.
>
> https://spark.apache.org/docs/latest/running-on-yarn.html
>
>
> On 21 Sep 2016, at 22:02, Michael Segel  wrote:
>
> Iâ€™ve asked this question a couple of times from a friend who didnâ€™t know
> the answerâ€¦ so I thought I would try here.
>
>
> Suppose we launch a job on a cluster (YARN) and we have set up the
> containers to be 3GB in size.
>
>
> What does that 3GB represent?
>
> I mean what happens if we end up using 2-3GB of off heap storage via
> tungsten?
> What will Spark do?
> Will it try to honor the containerâ€™s limits and throw an exception or will
> it allow my job to grab that amount of memory and exceed YARNâ€™s
> expectations since its off heap?
>
> Thx
>
> -Mike
>
> B‹CB• È
> [œÝXœØÜšX™H K[XZ[ ˆ \Ù\‹][œÝXœØÜšX™P Ü \šË˜\ XÚ K›Ü™ÃBƒ

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Israel Spark Meetup

2016-09-21 Thread Sean Owen

Done.

On Wed, Sep 21, 2016 at 5:53 AM, Romi Kuntsman  wrote:
> Hello,
> Please add a link in Spark Community page
> (https://spark.apache.org/community.html)
> To Israel Spark Meetup (https://www.meetup.com/israel-spark-users/)
> We're an active meetup group, unifying the local Spark user community, and
> having regular meetups.
> Thanks!
> Romi K.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: SPARK-10835 in 2.0

2016-09-20 Thread Sean Owen

You can probably just do an identity transformation on the column to
make its type a nullable String array -- ArrayType(StringType, true).
Of course, I'm not sure why Word2Vec must reject a non-null array type
when it can of course handle nullable, but the previous discussion
indicated that this had to do with how UDFs work too.

On Tue, Sep 20, 2016 at 4:03 PM, janardhan shetty
<janardhan...@gmail.com> wrote:
> Hi Sean,
>
> Any suggestions for workaround as of now?
>
> On Sep 20, 2016 7:46 AM, "janardhan shetty" <janardhan...@gmail.com> wrote:
>>
>> Thanks Sean.
>>
>> On Sep 20, 2016 7:45 AM, "Sean Owen" <so...@cloudera.com> wrote:
>>>
>>> Ah, I think that this was supposed to be changed with SPARK-9062. Let
>>> me see about reopening 10835 and addressing it.
>>>
>>> On Tue, Sep 20, 2016 at 3:24 PM, janardhan shetty
>>> <janardhan...@gmail.com> wrote:
>>> > Is this a bug?
>>> >
>>> > On Sep 19, 2016 10:10 PM, "janardhan shetty" <janardhan...@gmail.com>
>>> > wrote:
>>> >>
>>> >> Hi,
>>> >>
>>> >> I am hitting this issue.
>>> >> https://issues.apache.org/jira/browse/SPARK-10835.
>>> >>
>>> >> Issue seems to be resolved but resurfacing in 2.0 ML. Any workaround
>>> >> is
>>> >> appreciated ?
>>> >>
>>> >> Note:
>>> >> Pipeline has Ngram before word2Vec.
>>> >>
>>> >> Error:
>>> >> val word2Vec = new
>>> >>
>>> >> Word2Vec().setInputCol("wordsGrams").setOutputCol("features").setVectorSize(128).setMinCount(10)
>>> >>
>>> >> scala> word2Vec.fit(grams)
>>> >> java.lang.IllegalArgumentException: requirement failed: Column
>>> >> wordsGrams
>>> >> must be of type ArrayType(StringType,true) but was actually
>>> >> ArrayType(StringType,false).
>>> >>   at scala.Predef$.require(Predef.scala:224)
>>> >>   at
>>> >>
>>> >> org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
>>> >>   at
>>> >>
>>> >> org.apache.spark.ml.feature.Word2VecBase$class.validateAndTransformSchema(Word2Vec.scala:111)
>>> >>   at
>>> >>
>>> >> org.apache.spark.ml.feature.Word2Vec.validateAndTransformSchema(Word2Vec.scala:121)
>>> >>   at
>>> >>
>>> >> org.apache.spark.ml.feature.Word2Vec.transformSchema(Word2Vec.scala:187)
>>> >>   at
>>> >> org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:70)
>>> >>   at org.apache.spark.ml.feature.Word2Vec.fit(Word2Vec.scala:170)
>>> >>
>>> >>
>>> >> Github code for Ngram:
>>> >>
>>> >>
>>> >> override protected def validateInputType(inputType: DataType): Unit =
>>> >> {
>>> >> require(inputType.sameType(ArrayType(StringType)),
>>> >>   s"Input type must be ArrayType(StringType) but got $inputType.")
>>> >>   }
>>> >>
>>> >>   override protected def outputDataType: DataType = new
>>> >> ArrayType(StringType, false)
>>> >> }
>>> >>
>>> >

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: SPARK-10835 in 2.0

2016-09-20 Thread Sean Owen

Ah, I think that this was supposed to be changed with SPARK-9062. Let
me see about reopening 10835 and addressing it.

On Tue, Sep 20, 2016 at 3:24 PM, janardhan shetty
 wrote:
> Is this a bug?
>
> On Sep 19, 2016 10:10 PM, "janardhan shetty"  wrote:
>>
>> Hi,
>>
>> I am hitting this issue.
>> https://issues.apache.org/jira/browse/SPARK-10835.
>>
>> Issue seems to be resolved but resurfacing in 2.0 ML. Any workaround is
>> appreciated ?
>>
>> Note:
>> Pipeline has Ngram before word2Vec.
>>
>> Error:
>> val word2Vec = new
>> Word2Vec().setInputCol("wordsGrams").setOutputCol("features").setVectorSize(128).setMinCount(10)
>>
>> scala> word2Vec.fit(grams)
>> java.lang.IllegalArgumentException: requirement failed: Column wordsGrams
>> must be of type ArrayType(StringType,true) but was actually
>> ArrayType(StringType,false).
>>   at scala.Predef$.require(Predef.scala:224)
>>   at
>> org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
>>   at
>> org.apache.spark.ml.feature.Word2VecBase$class.validateAndTransformSchema(Word2Vec.scala:111)
>>   at
>> org.apache.spark.ml.feature.Word2Vec.validateAndTransformSchema(Word2Vec.scala:121)
>>   at
>> org.apache.spark.ml.feature.Word2Vec.transformSchema(Word2Vec.scala:187)
>>   at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:70)
>>   at org.apache.spark.ml.feature.Word2Vec.fit(Word2Vec.scala:170)
>>
>>
>> Github code for Ngram:
>>
>>
>> override protected def validateInputType(inputType: DataType): Unit = {
>> require(inputType.sameType(ArrayType(StringType)),
>>   s"Input type must be ArrayType(StringType) but got $inputType.")
>>   }
>>
>>   override protected def outputDataType: DataType = new
>> ArrayType(StringType, false)
>> }
>>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Java Compatibity Problems when we install rJava

2016-09-19 Thread Sean Owen

This isn't a Spark question, so I don't think this is the right place.

It shows that compilation of rJava failed for lack of some other
shared libraries (not Java-related). I think you'd have to get those
packages installed locally too.

If it ends up being Anaconda specific, you should try Continuum, or if
it looks CDH-related head to Cloudera support.

On Mon, Sep 19, 2016 at 8:29 PM, Arif,Mubaraka  wrote:
> We are trying to install rJava on suse Linux running Cloudera Hadoop CDH
> 5.7.2  with Spark 1.6.
>
> Anaconda 4.0 was installed using the CDH parcel.
>
>
>
> Have setup for Jupyter notebook but there are Java compability problems.
>
>
>
> For Java we are running :
>
>
>
> java version "1.8.0_51"
> Java(TM) SE Runtime Environment (build 1.8.0_51-tdc1-b16)
> Java HotSpot(TM) 64-Bit Server VM (build 25.51-b03, mixed mode)
>
>
>
> We followed the instructions from the blog :
> http://www.ibm.com/support/knowledgecenter/SSPT3X_3.0.0/com.ibm.swg.im.infosphere.biginsights.install.doc/doc/install_install_r.html
>
>
>
> After running we get the output : [as attached in file - rJava_err.txt]
>
>
>
> Any help is greatly appreciated.
>
>
>
> thanks,
>
> Muby
>
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: off heap to alluxio/tachyon in Spark 2

2016-09-19 Thread Sean Owen

It backed the "OFF_HEAP" storage level for RDDs. That's not quite the
same thing that off-heap Tungsten allocation refers to.

It's also worth pointing out that things like HDFS also can put data
into memory already.

On Mon, Sep 19, 2016 at 7:48 PM, Richard Catlin
 wrote:
> Here is my understanding.
>
> Spark used Tachyon as an off-heap solution for RDDs.  In certain situations,
> it would alleviate Garbage Collection or the RDDs.
>
> Tungsten, Spark 2’s off-heap (columnar format) is much more efficient and
> used as the default.  Alluvio no longer makes sense for this use.
>
>
> You can still use Tachyon/Alluxio to bring your files into Memory, which is
> quicker for Spark to access than your DFS(HDFS or S3).
>
> Alluxio actually supports a “Tiered Filesystem”, and automatically brings
> the “hotter” files into the fastest storage (Memory, SSD).  You can
> configure it with Memory, SSD, and/or HDDs with the DFS as the persistent
> store, called under-filesystem.
>
> Hope this helps.
>
> Richard Catlin
>
> On Sep 19, 2016, at 7:56 AM, aka.fe2s  wrote:
>
> Hi folks,
>
> What has happened with Tachyon / Alluxio in Spark 2? Doc doesn't mention it
> no longer.
>
> --
> Oleksiy Dyagilev
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Is RankingMetrics' NDCG implementation correct?

2016-09-19 Thread Sean Owen

Yes, relevance is always 1. The label is not a relevance score so
don't think it's valid to use it as such.

On Mon, Sep 19, 2016 at 4:42 AM, Jong Wook Kim  wrote:
> Hi,
>
> I'm trying to evaluate a recommendation model, and found that Spark and
> Rival give different results, and it seems that Rival's one is what Kaggle
> defines: https://gist.github.com/jongwook/5d4e78290eaef22cb69abbf68b52e597
>
> Am I using RankingMetrics in a wrong way, or is Spark's implementation
> incorrect?
>
> To my knowledge, NDCG should be dependent on the relevance (or preference)
> values, but Spark's implementation seems not; it uses 1.0 where it should be
> 2^(relevance) - 1, probably assuming that relevance is all 1.0? I also tried
> tweaking, but its method to obtain the ideal DCG also seems wrong.
>
> Any feedback from MLlib developers would be appreciated. I made a
> modified/extended version of RankingMetrics that produces the identical
> numbers to Kaggle and Rival's results, and I'm wondering if it is something
> appropriate to be added back to MLlib.
>
> Jong Wook

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Is there such thing as cache fusion with the underlying tables/files on HDFS

2016-09-18 Thread Sean Owen

Alluxio isn't a database though; it's storage. I may be still harping
on the wrong solution for you, but as we discussed offline, that's
also what Impala, Drill et al are for.

Sorry if this was mentioned before but Ignite is what GridGain became,
if that helps.

On Sat, Sep 17, 2016 at 11:00 PM, Mich Talebzadeh
 wrote:
> Thanks Todd
>
> As I thought Apache Ignite is a data fabric much like Oracle Coherence cache
> or HazelCast.
>
> The use case is different between an in-memory-database (IMDB) and Data
> Fabric. The build that I am dealing with has a 'database centric' view of
> its data (i.e. it accesses its data using Spark sql and JDBC) so an
> in-memory database will be a better fit. On the other hand If the
> application deals solely with Java objects and does not have any notion of a
> 'database', does not need SQL style queries and really just wants a
> distributed, high performance object storage grid, then I think Ignite would
> likely be the preferred choice.
>
> So will likely go if needed for an in-memory database like Alluxio. I have
> seen a rather debatable comparison between Spark and Ignite that looks to be
> like a one sided rant.
>
> HTH
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> Disclaimer: Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed. The
> author will in no case be liable for any monetary damages arising from such
> loss, damage or destruction.
>
>
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: NoSuchField Error : INSTANCE specify user defined httpclient jar

2016-09-18 Thread Sean Owen

NoSuchFieldError in an HTTP client class?
This almost always means you have a conflicting versions of an
unshaded dependency on your classpath, and in this case could be
httpclient. You can often work around this with the userClassPathFirst
options for driver and executor.

On Sun, Sep 18, 2016 at 1:33 AM, sagarcasual .  wrote:
> Hello,
> I am using Spark 1.6.1 distribution over Cloudera CDH 5.7.0 cluster.
> When I am running my fatJar - spark jar and when it is making a call to
> HttpClient it is getting classic NoSuchField Error : INSTANCE. Which happens
> usually when httrpclient in classpath is older than anticipated httpclient
> jar version (should be 4.5.2).
> Can someone help me how do I overcome this, I tried multiple options
> 1. specifying --conf spark.files={path-to-httpclient-4.5.2.jar} and then
> using --conf spark.driver.extraClassPath=./httpclient-4.5.2.jar and --conf
> spark.executor.extraClassPath=./httpclient-4.5.2.jar
> 2. I specified the jar in --jar option
> 3. I specified jar in hdfs://master:8020/path/to/jar/httpclient.4.5.2.jar
> In each case I had error.
> I also would like to know if
> --conf spark.files={path-to-httpclient-4.5.2.jar},
> is path-to-httpclient-4.5.2.jar can be giving to some local path from where
> I am issuing spark-submit?
> And also are there any other suggestions how should I resolve this conflict?
>
> Any clues?
> -Regards
> Sagar

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: How PolynomialExpansion works

2016-09-16 Thread Sean Owen

The result includes, essentially, all the terms in (x+y) and (x+y)^2,
and so on up if you chose a higher power. It is not just the
second-degree terms.

On Fri, Sep 16, 2016 at 7:43 PM, Nirav Patel  wrote:
> Doc says:
>
> Take a 2-variable feature vector as an example: (x, y), if we want to expand
> it with degree 2, then we get (x, x * x, y, x * y, y * y).
>
> I know polynomial expansion of (x+y)^2 = x^2 + 2xy + y^2 but can't relate it
> to above.
>
> Thanks
>
>
>
>
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Issues while running MLlib matrix factorization ALS algorithm

2016-09-16 Thread Sean Owen

Oh this is the netflix dataset right? I recognize it from the number
of users/items. It's not fast on a laptop or anything, and takes
plenty of memory, but succeeds. I haven't run this recently but it
worked in Spark 1.x.

On Fri, Sep 16, 2016 at 5:13 PM, Roshani Nagmote
<roshaninagmo...@gmail.com> wrote:
> I am also surprised that I face this problems with fairy small dataset on 14
> M4.2xlarge machines.  Could you please let me know on which dataset you can
> run 100 iterations of rank 30 on your laptop?
>
> I am currently just trying to run the default example code given with spark
> to run ALS on movie lens dataset. I did not change anything in the code.
> However I am running this example on Netflix dataset (1.5 gb)
>
> Thanks,
> Roshani
>
>
> On Friday, September 16, 2016, Sean Owen <so...@cloudera.com> wrote:
>>
>> You may have to decrease the checkpoint interval to say 5 if you're
>> getting StackOverflowError. You may have a particularly deep lineage
>> being created during iterations.
>>
>> No space left on device means you don't have enough local disk to
>> accommodate the big shuffles in some stage. You can add more disk or
>> maybe look at tuning shuffle params to do more in memory and maybe
>> avoid spilling to disk as much.
>>
>> However, given the small data size, I'm surprised that you see either
>> problem.
>>
>> 10-20 iterations is usually where the model stops improving much anyway.
>>
>> I can run 100 iterations of rank 30 on my *laptop* so something is
>> fairly wrong in your setup or maybe in other parts of your user code.
>>
>> On Thu, Sep 15, 2016 at 10:00 PM, Roshani Nagmote
>> <roshaninagmo...@gmail.com> wrote:
>> > Hi,
>> >
>> > I need help to run matrix factorization ALS algorithm in Spark MLlib.
>> >
>> > I am using dataset(1.5Gb) having 480189 users and 17770 items formatted
>> > in
>> > similar way as Movielens dataset.
>> > I am trying to run MovieLensALS example jar on this dataset on AWS Spark
>> > EMR
>> > cluster having 14 M4.2xlarge slaves.
>> >
>> > Command run:
>> > /usr/lib/spark/bin/spark-submit --deploy-mode cluster --master yarn
>> > --class
>> > org.apache.spark.examples.mllib.MovieLensALS --jars
>> > /usr/lib/spark/examples/jars/scopt_2.11-3.3.0.jar
>> > /usr/lib/spark/examples/jars/spark-examples_2.11-2.0.0.jar --rank 32
>> > --numIterations 50 --kryo s3://dataset/input_dataset
>> >
>> > Issues I get:
>> > If I increase rank to 70 or more and numIterations 15 or more, I get
>> > following errors:
>> > 1) stack overflow error
>> > 2) No space left on device - shuffle phase
>> >
>> > Could you please let me know if there are any parameters I should tune
>> > to
>> > make this algorithm work on this dataset?
>> >
>> > For better rmse, I want to increase iterations. Am I missing something
>> > very
>> > trivial? Could anyone help me run this algorithm on this specific
>> > dataset
>> > with more iterations?
>> >
>> > Was anyone able to run ALS on spark with more than 100 iterations and
>> > rank
>> > more than 30?
>> >
>> > Any help will be greatly appreciated.
>> >
>> > Thanks and Regards,
>> > Roshani

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Issues while running MLlib matrix factorization ALS algorithm

2016-09-16 Thread Sean Owen

You may have to decrease the checkpoint interval to say 5 if you're
getting StackOverflowError. You may have a particularly deep lineage
being created during iterations.

No space left on device means you don't have enough local disk to
accommodate the big shuffles in some stage. You can add more disk or
maybe look at tuning shuffle params to do more in memory and maybe
avoid spilling to disk as much.

However, given the small data size, I'm surprised that you see either problem.

10-20 iterations is usually where the model stops improving much anyway.

I can run 100 iterations of rank 30 on my *laptop* so something is
fairly wrong in your setup or maybe in other parts of your user code.

On Thu, Sep 15, 2016 at 10:00 PM, Roshani Nagmote
 wrote:
> Hi,
>
> I need help to run matrix factorization ALS algorithm in Spark MLlib.
>
> I am using dataset(1.5Gb) having 480189 users and 17770 items formatted in
> similar way as Movielens dataset.
> I am trying to run MovieLensALS example jar on this dataset on AWS Spark EMR
> cluster having 14 M4.2xlarge slaves.
>
> Command run:
> /usr/lib/spark/bin/spark-submit --deploy-mode cluster --master yarn --class
> org.apache.spark.examples.mllib.MovieLensALS --jars
> /usr/lib/spark/examples/jars/scopt_2.11-3.3.0.jar
> /usr/lib/spark/examples/jars/spark-examples_2.11-2.0.0.jar --rank 32
> --numIterations 50 --kryo s3://dataset/input_dataset
>
> Issues I get:
> If I increase rank to 70 or more and numIterations 15 or more, I get
> following errors:
> 1) stack overflow error
> 2) No space left on device - shuffle phase
>
> Could you please let me know if there are any parameters I should tune to
> make this algorithm work on this dataset?
>
> For better rmse, I want to increase iterations. Am I missing something very
> trivial? Could anyone help me run this algorithm on this specific dataset
> with more iterations?
>
> Was anyone able to run ALS on spark with more than 100 iterations and rank
> more than 30?
>
> Any help will be greatly appreciated.
>
> Thanks and Regards,
> Roshani

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: countApprox

2016-09-16 Thread Sean Owen

countApprox gives the best answer within some timeout. Is it possible
that 1ms is more than enough to count this exactly? then the
confidence wouldn't matter. Although that seems way too fast, you're
counting ranges whose values don't actually matter, and maybe the
Python side is smart enough to use that fact. Then counting a
partition takes almost no time. Does it return immediately?

On Thu, Sep 15, 2016 at 6:20 PM, Stefano Lodi  wrote:
> I am experimenting with countApprox. I created a RDD of 10^8 numbers and ran
> countApprox with different parameters but I failed to generate any
> approximate output. In all runs it returns the exact number of elements.
> What is the effect of approximation in countApprox supposed to be, and for
> what inputs and parameters?
>
 rdd = sc.parallelize([random.choice(range(1000)) for i in range(10**8)],
 50)
 rdd.countApprox(1, 0.8)
> [Stage 12:>(0 + 0) /
> 50]16/09/15 15:45:28 WARN TaskSetManager: Stage 12 contains a task of very
> large size (5402 KB). The maximum recommended task size is 100 KB.
> [Stage 12:==> (49 + 1) /
> 50]1
 rdd.countApprox(1, 0.01)
> 16/09/15 15:45:45 WARN TaskSetManager: Stage 13 contains a task of very
> large size (5402 KB). The maximum recommended task size is 100 KB.
> [Stage 13:>   (47 + 3) /
> 50]1
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Best way to present data collected by Flume through Spark

2016-09-16 Thread Sean Owen

Why Hive and why precompute data at 15 minute latency? there are
several ways here to query the source data directly with no extra step
or latency here. Even Spark SQL is real-time-ish for queries on the
source data, and Impala (or heck Drill etc) are.

On Thu, Sep 15, 2016 at 10:56 PM, Mich Talebzadeh
 wrote:
> OK this seems to be working for the "Batch layer". I will try to create a
> functional diagram for it
>
> Publisher sends prices every two seconds
> Kafka receives data
> Flume delivers data from Kafka to HDFS on text files time stamped
> A Hive ORC external table (source table) is created on the directory where
> flume writes continuously
> All temporary flume tables are prefixed by "." (hidden files), so Hive
> external table does not see those
> Every price row includes a timestamp
> A conventional Hive table (target table) is created with all columns from
> the external table + two additional columns with one being a timestamp from
> Hive
> A cron job set up that runs ever 15 minutes  as below
> 0,15,30,45 00-23 * * 1-5 (/home/hduser/dba/bin/populate_marketData.ksh -D
> test > /var/tmp/populate_marketData_test.err 2>&1)
>
> This cron as can be seen runs runs every 15 minutes and refreshes the Hive
> target table with the new data. New data meaning the price created time >
> MAX(price created time) from the target table
>
> Target table statistics are updated at each run. It takes an average of 2
> minutes to run the job
> Thu Sep 15 22:45:01 BST 2016  === Started
> /home/hduser/dba/bin/populate_marketData.ksh  ===
> 15/09/2016 22:45:09.09
> 15/09/2016 22:46:57.57
> 2016-09-15T22:46:10
> 2016-09-15T22:46:57
> Thu Sep 15 22:47:21 BST 2016  === Completed
> /home/hduser/dba/bin/populate_marketData.ksh  ===
>
>
> So the target table is 15 minutes out of sync with flume data which is not
> bad.
>
> Assuming that I replace ORC tables with Parquet, druid whatever, that can be
> done pretty easily. However, although I am using Zeppelin here, people may
> decide to use Tableau, QlikView etc which we need to think about the
> connectivity between these notebooks and the underlying database. I know
> Tableau and it is very SQL centric and works with ODBC and JDBC drivers or
> native drivers. For example I know that Tableau comes with Hive supplied
> ODBC drivers. I am not sure these database have drivers for Druid etc?
>
> Let me know your thoughts.
>
> Cheers
>
> Dr Mich Talebzadeh
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Best way to present data collected by Flume through Spark

2016-09-15 Thread Sean Owen

If your core requirement is ad-hoc real-time queries over the data,
then the standard Hadoop-centric answer would be:

Ingest via Kafka,
maybe using Flume, or possibly Spark Streaming, to read and land the data, in...
Parquet on HDFS or possibly Kudu, and
Impala to query

>> On 15 September 2016 at 09:35, Mich Talebzadeh 
>> wrote:
>>>
>>> Hi,
>>>
>>> This is for fishing for some ideas.
>>>
>>> In the design we get prices directly through Kafka into Flume and store
>>> it on HDFS as text files
>>> We can then use Spark with Zeppelin to present data to the users.
>>>
>>> This works. However, I am aware that once the volume of flat files rises
>>> one needs to do housekeeping. You don't want to read all files every time.
>>>
>>> A more viable alternative would be to read data into some form of tables
>>> (Hive etc) periodically through an hourly cron set up so batch process will
>>> have up to date and accurate data up to last hour.
>>>
>>> That certainly be an easier option for the users as well.
>>>
>>> I was wondering what would be the best strategy here. Druid, Hive others?
>>>
>>> The business case here is that users may want to access older data so a
>>> database of some sort will be a better solution? In all likelihood they want
>>> a week's data.
>>>
>>> Thanks
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> Disclaimer: Use it at your own risk. Any and all responsibility for any
>>> loss, damage or destruction of data or any other property which may arise
>>> from relying on this email's technical content is explicitly disclaimed. The
>>> author will in no case be liable for any monetary damages arising from such
>>> loss, damage or destruction.
>>>
>>>
>>
>>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Please assist: migrating RandomForestExample from MLLib to ML

2016-09-14 Thread Sean Owen

If it helps, I've already updated that code for the 2nd edition, which
will be based on ~Spark 2.1:

https://github.com/sryza/aas/blob/master/ch04-rdf/src/main/scala/com/cloudera/datascience/rdf/RunRDF.scala#L220

This should be an equivalent working example that deals with
categoricals via VectorIndexer.

You're right that you must use it because it adds the metadata that
says it's categorical. I'm not sure of another way to do it?

Sean


On Wed, Sep 14, 2016 at 10:18 PM, Marco Mistroni  wrote:
> hi all
>  i have been toying around with this well known RandomForestExample code
>
> val forest = RandomForest.trainClassifier(
>   trainData, 7, Map(10 -> 4, 11 -> 40), 20,
>   "auto", "entropy", 30, 300)
>
> This comes from this link
> (https://www.safaribooksonline.com/library/view/advanced-analytics-with/9781491912751/ch04.html),
> and also Sean Owen's presentation
>
> (https://www.youtube.com/watch?v=ObiCMJ24ezs)
>
>
>
> and now i want to migrate it to use ML Libraries.
> The problem i have is that the MLLib  example has categorical features, and
> i cannot find
> a way to use categorical features with ML
> Apparently i should use VectorIndexer, but VectorIndexer assumes only one
> input
> column for features.
> I am at the moment using Vectorassembler instead, but i cannot find a way to
> achieve the
> same
> I have checed spark samples, but all i can see is RandomForestClassifier
> using VectorIndexer for 1 feature
>
>
>
> Could anyone assist?
> This is my current codewhat do i need to add to take into account
> categorical features?
>
> val labelIndexer = new StringIndexer()
>   .setInputCol("Col0")
>   .setOutputCol("indexedLabel")
>   .fit(data)
>
> val features = new VectorAssembler()
>   .setInputCols(Array(
> "Col1", "Col2", "Col3", "Col4", "Col5",
> "Col6", "Col7", "Col8", "Col9", "Col10"))
>   .setOutputCol("features")
>
> val labelConverter = new IndexToString()
>   .setInputCol("prediction")
>   .setOutputCol("predictedLabel")
>   .setLabels(labelIndexer.labels)
>
> val rf = new RandomForestClassifier()
>   .setLabelCol("indexedLabel")
>   .setFeaturesCol("features")
>   .setNumTrees(20)
>   .setMaxDepth(30)
>   .setMaxBins(300)
>   .setImpurity("entropy")
>
> println("Kicking off pipeline..")
>
> val pipeline = new Pipeline()
>   .setStages(Array(labelIndexer, features, rf, labelConverter))
>
> thanks in advance and regards
>  Marco
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: RMSE in ALS

2016-09-14 Thread Sean Owen

Yes, that's what TF-IDF is, but it's just a statistic and not a
ranking. If you're using that to fill in a user-item matrix then that
is your model; you don't need ALS. Building an ALS model on this is
kind of like building a model on a model. Applying RMSE in this case
is a little funny, given the distribution of TF-IDF values. It's hard
to say what's normal but you're saying the test error is both 2.3 and
32.5. Regardless of which is really the test error it indicates
something is wrong with the modeling process. These ought not be too
different.

On Wed, Sep 14, 2016 at 9:22 PM, Pasquinell Urbani
<pasquinell.urb...@exalitica.com> wrote:
> The implicit rankings are the output of Tf-idf. I.e.:
> Each_ranking= frecuency of an ítem * log(amount of total customers/amount of
> customers buying the ítem)
>
>
> El 14 sept. 2016 17:14, "Sean Owen" <so...@cloudera.com> escribió:
>>
>> What are implicit rankings here?
>> RMSE would not be an appropriate measure for comparing rankings. There are
>> ranking metrics like mean average precision that would be appropriate
>> instead.
>>
>> On Wed, Sep 14, 2016 at 9:11 PM, Pasquinell Urbani
>> <pasquinell.urb...@exalitica.com> wrote:
>>>
>>> It was a typo mistake, both are rmse.
>>>
>>> The frecency distribution of rankings is the following
>>>
>>>
>>>
>>> As you can see, I have heavy tail, but the majority of the observations
>>> rely near ranking  5.
>>>
>>> I'm working with implicit rankings (generated by TF-IDF), can this affect
>>> the error? (I'm currently using trainImplicit in ALS, spark 1.6.2)
>>>
>>> Thank you.
>>>
>>>
>>>
>>> 2016-09-14 16:49 GMT-03:00 Sean Owen <so...@cloudera.com>:
>>>>
>>>> There is no way to answer this without knowing what your inputs are
>>>> like. If they're on the scale of thousands, that's small (good). If
>>>> they're on the scale of 1-5, that's extremely poor.
>>>>
>>>> What's RMS vs RMSE?
>>>>
>>>> On Wed, Sep 14, 2016 at 8:33 PM, Pasquinell Urbani
>>>> <pasquinell.urb...@exalitica.com> wrote:
>>>> > Hi Community
>>>> >
>>>> > I'm performing an ALS for retail product recommendation. Right now I'm
>>>> > reaching rms_test = 2.3 and rmse_test = 32.5. Is this too much in your
>>>> > experience? Does the transformation of the ranking values important
>>>> > for
>>>> > having good errors?
>>>> >
>>>> > Thank you all.
>>>> >
>>>> > Pasquinell Urbani
>>>
>>>
>>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: RMSE in ALS

2016-09-14 Thread Sean Owen

What are implicit rankings here?
RMSE would not be an appropriate measure for comparing rankings. There are
ranking metrics like mean average precision that would be appropriate
instead.

On Wed, Sep 14, 2016 at 9:11 PM, Pasquinell Urbani <
pasquinell.urb...@exalitica.com> wrote:

> It was a typo mistake, both are rmse.
>
> The frecency distribution of rankings is the following
>
> [image: Imágenes integradas 2]
>
> As you can see, I have heavy tail, but the majority of the observations
> rely near ranking  5.
>
> I'm working with implicit rankings (generated by TF-IDF), can this affect
> the error? (I'm currently using trainImplicit in ALS, spark 1.6.2)
>
> Thank you.
>
>
>
> 2016-09-14 16:49 GMT-03:00 Sean Owen <so...@cloudera.com>:
>
>> There is no way to answer this without knowing what your inputs are
>> like. If they're on the scale of thousands, that's small (good). If
>> they're on the scale of 1-5, that's extremely poor.
>>
>> What's RMS vs RMSE?
>>
>> On Wed, Sep 14, 2016 at 8:33 PM, Pasquinell Urbani
>> <pasquinell.urb...@exalitica.com> wrote:
>> > Hi Community
>> >
>> > I'm performing an ALS for retail product recommendation. Right now I'm
>> > reaching rms_test = 2.3 and rmse_test = 32.5. Is this too much in your
>> > experience? Does the transformation of the ranking values important for
>> > having good errors?
>> >
>> > Thank you all.
>> >
>> > Pasquinell Urbani
>>
>
>

Re: RMSE in ALS

2016-09-14 Thread Sean Owen

There is no way to answer this without knowing what your inputs are
like. If they're on the scale of thousands, that's small (good). If
they're on the scale of 1-5, that's extremely poor.

What's RMS vs RMSE?

On Wed, Sep 14, 2016 at 8:33 PM, Pasquinell Urbani
 wrote:
> Hi Community
>
> I'm performing an ALS for retail product recommendation. Right now I'm
> reaching rms_test = 2.3 and rmse_test = 32.5. Is this too much in your
> experience? Does the transformation of the ranking values important for
> having good errors?
>
> Thank you all.
>
> Pasquinell Urbani

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: What's the best way to find the nearest neighbor in Spark? Any windowing function?

2016-09-13 Thread Sean Owen

Given the nature of your metric, I don't think you can use things like
LSH which more or less depend on a continuous metric space. This is
too specific to fit into a general framework usefully I think, but, I
think you can solve this directly with some code without much trouble.

On Tue, Sep 13, 2016 at 8:45 PM, Mobius ReX <aoi...@gmail.com> wrote:
> Hi Sean,
>
> Great!
>
> Is there any sample code implementing Locality Sensitive Hashing with Spark,
> in either scala or python?
>
> "However if your rule is really like "must match column A and B and
> then closest value in column C then just ordering everything by A, B,
> C lets you pretty much read off the answer from the result set
> directly. Everything is closest to one of its two neighbors."
>
> This is interesting since we can use Lead/Lag Windowing function if we have
> only one continuous column. However,
> our rule is "must match column A and B and then closest values in column C
> and D - for any ID with column E = 0, and the closest ID with Column E = 1".
> The distance metric between ID1 (with Column E =0) and ID2 (with Column E
> =1) is defined as
> abs( C1/C1 - C2/C1 ) + abs (D1/D1 - D2/D1)
> One cannot do
> abs( (C1/C1 + D1/D1) - (C2/C1 + D2/ D1) )
>
>
> Any further tips?
>
> Best,
> Rex
>
>
>
> On Tue, Sep 13, 2016 at 11:09 AM, Sean Owen <so...@cloudera.com> wrote:
>>
>> The key is really to specify the distance metric that defines
>> "closeness" for you. You have features that aren't on the same scale,
>> and some that aren't continuous. You might look to clustering for
>> ideas here, though mostly you just want to normalize the scale of
>> dimensions to make them comparable.
>>
>> You can find nearest neighbors by brute force. If speed really matters
>> you can consider locality sensitive hashing, which isn't that hard to
>> implement and can give a lot of speed for a small cost in accuracy.
>>
>> However if your rule is really like "must match column A and B and
>> then closest value in column C then just ordering everything by A, B,
>> C lets you pretty much read off the answer from the result set
>> directly. Everything is closest to one of its two neighbors.
>>
>> On Tue, Sep 13, 2016 at 6:18 PM, Mobius ReX <aoi...@gmail.com> wrote:
>> > Given a table
>> >
>> >> $cat data.csv
>> >>
>> >> ID,State,City,Price,Number,Flag
>> >> 1,CA,A,100,1000,0
>> >> 2,CA,A,96,1010,1
>> >> 3,CA,A,195,1010,1
>> >> 4,NY,B,124,2000,0
>> >> 5,NY,B,128,2001,1
>> >> 6,NY,C,24,3,0
>> >> 7,NY,C,27,30100,1
>> >> 8,NY,C,29,30200,0
>> >> 9,NY,C,39,33000,1
>> >
>> >
>> > Expected Result:
>> >
>> > ID0, ID1
>> > 1,2
>> > 4,5
>> > 6,7
>> > 8,7
>> >
>> > for each ID with Flag=0 above, we want to find another ID from Flag=1,
>> > with
>> > the same "State" and "City", and the nearest Price and Number normalized
>> > by
>> > the corresponding values of that ID with Flag=0.
>> >
>> > For example, ID = 1 and ID=2, has the same State and City, but different
>> > FLAG.
>> > After normalized the Price and Number (Price divided by 100, Number
>> > divided
>> > by 1000), the distance between ID=1 and ID=2 is defined as :
>> > abs(100/100 - 96/100) + abs(1000/1000 - 1010/1000) = 0.04 + 0.01 = 0.05
>> >
>> >
>> > What's the best way to find such nearest neighbor? Any valuable tips
>> > will be
>> > greatly appreciated!
>> >
>> >
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Character encoding corruption in Spark JDBC connector

2016-09-13 Thread Sean Owen

Based on your description, this isn't a problem in Spark. It means
your JDBC connector isn't interpreting bytes from the database
according to the encoding in which they were written. It could be
Latin1, sure.

But if "new String(ResultSet.getBytes())" works, it's only because
your platform's default JVM encoding is Latin1 too. Really you need to
specify the encoding directly in that constructor, or else this will
not in general work on other platforms, no.

That's not the solution though; ideally you find the setting that lets
the JDBC connector read the data as intended.

On Tue, Sep 13, 2016 at 8:02 PM, Mark Bittmann  wrote:
> Hello Spark community,
>
> I'm reading from a MySQL database into a Spark dataframe using the JDBC
> connector functionality, and I'm experiencing some character encoding
> issues. The default encoding for MySQL strings is latin1, but the mysql JDBC
> connector implementation of "ResultSet.getString()" will return an mangled
> unicode encoding of the data for certain characters such as the "all rights
> reserved" char. Instead, you can use "new String(ResultSet.getBytes())"
> which will return the correctly encoded string. I've confirmed this behavior
> with the mysql connector classes (i.e., without using the Spark wrapper).
>
> I can see here that the Spark JDBC connector uses getString(), though there
> is a note to move to getBytes() for performance reasons:
>
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L389
>
> For some special chars, I can reverse the behavior with a UDF that applies
> new String(badString.getBytes("Cp1252") , "UTF-8"), however for some foreign
> characters the underlying byte array is irreversibly changed and the data is
> corrupted.
>
> I can submit an issue/PR to fix it going forward if "new
> String(ResultSet.getBytes())" is the correct approach.
>
> Meanwhile, can anyone offer any recommendations on how to correct this
> behavior prior to it getting to a dataframe? I've tried every permutation of
> the settings in the JDBC connection url (characterSetResults,
> characterEncoding).
>
> I'm on Spark 1.6.
>
> Thanks!

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark 2.0.0 won't let you create a new SparkContext?

2016-09-13 Thread Sean Owen

You would generally use --conf to set this on the command line if using the
shell.

On Tue, Sep 13, 2016, 19:22 Kevin Burton <bur...@spinn3r.com> wrote:

> The problem is that without a new spark context, with a custom conf,
> elasticsearch-hadoop is refusing to read in settings about the ES setup...
>
> if I do a sc.stop() , then create a new one, it seems to work fine.
>
> But it isn't really documented anywhere and all the existing documentation
> is now invalid because you get an exception when you try to create a new
> spark context.
>
> On Tue, Sep 13, 2016 at 11:13 AM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> I think this works in a shell but you need to allow multiple spark
>> contexts
>>
>> Spark context Web UI available at http://50.140.197.217:5
>> Spark context available as 'sc' (master = local, app id =
>> local-1473789661846).
>> Spark session available as 'spark'.
>> Welcome to
>>     __
>>  / __/__  ___ _/ /__
>> _\ \/ _ \/ _ `/ __/  '_/
>>/___/ .__/\_,_/_/ /_/\_\   version 2.0.0
>>   /_/
>> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java
>> 1.8.0_77)
>> Type in expressions to have them evaluated.
>> Type :help for more information.
>>
>> scala> import org.apache.spark.SparkContext
>> import org.apache.spark.SparkContext
>> scala>  val conf = new
>> SparkConf().setMaster("local[2]").setAppName("CountingSheep").
>> *set("spark.driver.allowMultipleContexts", "true")*conf:
>> org.apache.spark.SparkConf = org.apache.spark.SparkConf@bb5f9d
>> scala> val sc = new SparkContext(conf)
>> sc: org.apache.spark.SparkContext = org.apache.spark.SparkContext@4888425d
>>
>>
>> HTH
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 13 September 2016 at 18:57, Sean Owen <so...@cloudera.com> wrote:
>>
>>> But you're in the shell there, which already has a SparkContext for you
>>> as sc.
>>>
>>> On Tue, Sep 13, 2016 at 6:49 PM, Kevin Burton <bur...@spinn3r.com>
>>> wrote:
>>>
>>>> I'm rather confused here as to what to do about creating a new
>>>> SparkContext.
>>>>
>>>> Spark 2.0 prevents it... (exception included below)
>>>>
>>>> yet a TON of examples I've seen basically tell you to create a new
>>>> SparkContext as standard practice:
>>>>
>>>>
>>>> http://spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties
>>>>
>>>> val conf = new SparkConf()
>>>>  .setMaster("local[2]")
>>>>  .setAppName("CountingSheep")val sc = new SparkContext(conf)
>>>>
>>>>
>>>> I'm specifically running into a problem in that ES hadoop won't work
>>>> with its settings and I think its related to this problme.
>>>>
>>>> Do we have to call sc.stop() first and THEN create a new spark context?
>>>>
>>>> That works,, but I can't find any documentation anywhere telling us the
>>>> right course of action.
>>>>
>>>>
>>>>
>>>> scala> val sc = new SparkContext();
>>>> org.apache.spark.SparkException: Only one SparkContext may be running
>>>> in this JVM (see SPARK-2243). To ignore this error, set
>>>> spark.driver.allowMultipleContexts = true. The currently running
>>>> SparkContext was created at:
>>>>
>>>> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:823)
>>>> org.apache.spark.repl.Main$.createSparkSession(Main.scala:95)
>>>> (:15)
>>>> (:31)
>>>> (:33)
>>>> .(:37)
>>>> .()
>>>> .$print$lzycompute(:7)
>>>> .$print(

Re: What's the best way to find the nearest neighbor in Spark? Any windowing function?

2016-09-13 Thread Sean Owen

The key is really to specify the distance metric that defines
"closeness" for you. You have features that aren't on the same scale,
and some that aren't continuous. You might look to clustering for
ideas here, though mostly you just want to normalize the scale of
dimensions to make them comparable.

You can find nearest neighbors by brute force. If speed really matters
you can consider locality sensitive hashing, which isn't that hard to
implement and can give a lot of speed for a small cost in accuracy.

However if your rule is really like "must match column A and B and
then closest value in column C then just ordering everything by A, B,
C lets you pretty much read off the answer from the result set
directly. Everything is closest to one of its two neighbors.

On Tue, Sep 13, 2016 at 6:18 PM, Mobius ReX  wrote:
> Given a table
>
>> $cat data.csv
>>
>> ID,State,City,Price,Number,Flag
>> 1,CA,A,100,1000,0
>> 2,CA,A,96,1010,1
>> 3,CA,A,195,1010,1
>> 4,NY,B,124,2000,0
>> 5,NY,B,128,2001,1
>> 6,NY,C,24,3,0
>> 7,NY,C,27,30100,1
>> 8,NY,C,29,30200,0
>> 9,NY,C,39,33000,1
>
>
> Expected Result:
>
> ID0, ID1
> 1,2
> 4,5
> 6,7
> 8,7
>
> for each ID with Flag=0 above, we want to find another ID from Flag=1, with
> the same "State" and "City", and the nearest Price and Number normalized by
> the corresponding values of that ID with Flag=0.
>
> For example, ID = 1 and ID=2, has the same State and City, but different
> FLAG.
> After normalized the Price and Number (Price divided by 100, Number divided
> by 1000), the distance between ID=1 and ID=2 is defined as :
> abs(100/100 - 96/100) + abs(1000/1000 - 1010/1000) = 0.04 + 0.01 = 0.05
>
>
> What's the best way to find such nearest neighbor? Any valuable tips will be
> greatly appreciated!
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark 2.0.0 won't let you create a new SparkContext?

2016-09-13 Thread Sean Owen

But you're in the shell there, which already has a SparkContext for you as
sc.

On Tue, Sep 13, 2016 at 6:49 PM, Kevin Burton  wrote:

> I'm rather confused here as to what to do about creating a new
> SparkContext.
>
> Spark 2.0 prevents it... (exception included below)
>
> yet a TON of examples I've seen basically tell you to create a new
> SparkContext as standard practice:
>
> http://spark.apache.org/docs/latest/configuration.html#
> dynamically-loading-spark-properties
>
> val conf = new SparkConf()
>  .setMaster("local[2]")
>  .setAppName("CountingSheep")val sc = new SparkContext(conf)
>
>
> I'm specifically running into a problem in that ES hadoop won't work with
> its settings and I think its related to this problme.
>
> Do we have to call sc.stop() first and THEN create a new spark context?
>
> That works,, but I can't find any documentation anywhere telling us the
> right course of action.
>
>
>
> scala> val sc = new SparkContext();
> org.apache.spark.SparkException: Only one SparkContext may be running in
> this JVM (see SPARK-2243). To ignore this error, set 
> spark.driver.allowMultipleContexts
> = true. The currently running SparkContext was created at:
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.
> scala:823)
> org.apache.spark.repl.Main$.createSparkSession(Main.scala:95)
> (:15)
> (:31)
> (:33)
> .(:37)
> .()
> .$print$lzycompute(:7)
> .$print(:6)
> $print()
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:
> 62)
> sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:43)
> java.lang.reflect.Method.invoke(Method.java:497)
> scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:786)
> scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1047)
> scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$
> loadAndRunReq$1.apply(IMain.scala:638)
> scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$
> loadAndRunReq$1.apply(IMain.scala:637)
> scala.reflect.internal.util.ScalaClassLoader$class.
> asContext(ScalaClassLoader.scala:31)
> scala.reflect.internal.util.AbstractFileClassLoader.asContext(
> AbstractFileClassLoader.scala:19)
>   at org.apache.spark.SparkContext$$anonfun$assertNoOtherContextIsRunning$
> 2.apply(SparkContext.scala:2221)
>   at org.apache.spark.SparkContext$$anonfun$assertNoOtherContextIsRunning$
> 2.apply(SparkContext.scala:2217)
>   at scala.Option.foreach(Option.scala:257)
>   at org.apache.spark.SparkContext$.assertNoOtherContextIsRunning(
> SparkContext.scala:2217)
>   at org.apache.spark.SparkContext$.markPartiallyConstructed(
> SparkContext.scala:2290)
>   at org.apache.spark.SparkContext.(SparkContext.scala:89)
>   at org.apache.spark.SparkContext.(SparkContext.scala:121)
>   ... 48 elided
>
>
> --
>
> We’re hiring if you know of any awesome Java Devops or Linux Operations
> Engineers!
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> 
>
>

Re: I noticed LinearRegression sometimes produces negative R^2 values

2016-09-07 Thread Sean Owen

Yes, should be.
It's also not necessarily nonnegative if you evaluate R^2 on a
different data set than you fit it to. Is that the case?

On Tue, Sep 6, 2016 at 11:15 PM, Evan Zamir <zamir.e...@gmail.com> wrote:
> I am using the default setting for setting fitIntercept, which *should* be
> TRUE right?
>
> On Tue, Sep 6, 2016 at 1:38 PM Sean Owen <so...@cloudera.com> wrote:
>>
>> Are you not fitting an intercept / regressing through the origin? with
>> that constraint it's no longer true that R^2 is necessarily
>> nonnegative. It basically means that the errors are even bigger than
>> what you'd get by predicting the data's mean value as a constant
>> model.
>>
>> On Tue, Sep 6, 2016 at 8:49 PM, evanzamir <zamir.e...@gmail.com> wrote:
>> > Am I misinterpreting what r2() in the LinearRegression Model summary
>> > means?
>> > By definition, R^2 should never be a negative number!
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> > http://apache-spark-user-list.1001560.n3.nabble.com/I-noticed-LinearRegression-sometimes-produces-negative-R-2-values-tp27667.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> >
>> > -
>> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> >

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: I noticed LinearRegression sometimes produces negative R^2 values

2016-09-06 Thread Sean Owen

Are you not fitting an intercept / regressing through the origin? with
that constraint it's no longer true that R^2 is necessarily
nonnegative. It basically means that the errors are even bigger than
what you'd get by predicting the data's mean value as a constant
model.

On Tue, Sep 6, 2016 at 8:49 PM, evanzamir  wrote:
> Am I misinterpreting what r2() in the LinearRegression Model summary means?
> By definition, R^2 should never be a negative number!
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/I-noticed-LinearRegression-sometimes-produces-negative-R-2-values-tp27667.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Why there is no top method in dataset api

2016-09-05 Thread Sean Owen

No, 
I'm not advising you to use .rdd, just saying it is possible.
Although I'd only use RDDs if you had a good reason to, given Datasets
now, they are not gone or even deprecated.

You do not need to order the whole data set to get the top eleme
nt. That isn't what top does though. You might be interested to look at
the source code. Nor is it what orderBy does if the optimizer is any good.

Computing .rdd doesn't materialize an RDD. It involves some non-zero
overhead in creating a plan, which should be minor compared to execution.
So would any computation of "top N" on a Dataset, so I don't think this is
relevant.


orderBy + take is already the way to accomplish "Dataset.top". It works on
Datasets, and therefore DataFrames too, for the reason you give. I'm not
sure what you're asking there.


On Mon, Sep 5, 2016, 13:01 Jakub Dubovsky <spark.dubovsky.ja...@gmail.com>
wrote:

> Thanks Sean,
>
> I was under impression that spark creators are trying to persuade user
> community not to use RDD api directly. Spark summit I attended was full of
> this. So I am a bit surprised that I hear use-rdd-api as an advice from
> you. But if this is a way then I have a second question. For conversion
> from dataset to rdd I would use Dataset.rdd lazy val. Since it is a lazy
> val it suggests there is some computation going on to create rdd as a copy.
> The question is how much computationally expansive is this conversion? If
> there is a significant overhead then it is clear why one would want to have
> top method directly on Dataset class.
>
> Ordering whole dataset only to take first 10 or so top records is not
> really an acceptable option for us. Comparison function can be expansive
> and the size of dataset is (unsurprisingly) big.
>
> To be honest I do not really understand what do you mean by b). Since
> DataFrame is now only an alias for Dataset[Row] what do you mean by
> "DataFrame-like counterpart"?
>
> Thanks
>
> On Thu, Sep 1, 2016 at 2:31 PM, Sean Owen <so...@cloudera.com> wrote:
>
>> You can always call .rdd.top(n) of course. Although it's slightly
>> clunky, you can also .orderBy($"value".desc).take(n). Maybe there's an
>> easier way.
>>
>> I don't think if there's a strong reason other than it wasn't worth it
>> to write this and many other utility wrappers that a) already exist on
>> the underlying RDD API if you want them, and b) have a DataFrame-like
>> counterpart already that doesn't really need wrapping in a different
>> API.
>>
>> On Thu, Sep 1, 2016 at 12:53 PM, Jakub Dubovsky
>> <spark.dubovsky.ja...@gmail.com> wrote:
>> > Hey all,
>> >
>> > in RDD api there is very usefull method called top. It finds top n
>> records
>> > in according to certain ordering without sorting all records. Very
>> usefull!
>> >
>> > There is no top method nor similar functionality in Dataset api. Has
>> anybody
>> > any clue why? Is there any specific reason for this?
>> >
>> > Any thoughts?
>> >
>> > thanks
>> >
>> > Jakub D.
>>
>
>

Re: How to detect when a JavaSparkContext gets stopped

2016-09-05 Thread Sean Owen

You can look into the SparkListener interface to get some of those
messages. Losing the master though is pretty fatal to all apps.

On Mon, Sep 5, 2016 at 7:30 AM, Hough, Stephen C  wrote:
> I have a long running application, configured to be HA, whereby only the
> designated leader will acquire a JavaSparkContext, listen for requests and
> push jobs onto this context.
>
>
>
> The problem I have is, whenever my AWS instances running workers die (either
> a time to live expires or I cancel those instances) it seems that Spark
> blames my driver, I see the following in logs.
>
>
>
> org.apache.spark.SparkException: Exiting due to error from cluster
> scheduler: Master removed our application: FAILED
>
>
>
> However my application doesn’t get a notification so thinks everything is
> okay, until it receives another request and tries to submit to the spark and
> gets a
>
>
>
> java.lang.IllegalStateException: Cannot call methods on a stopped
> SparkContext.
>
>
>
> Is there a way I can observe when the JavaSparkContext I own is stopped?
>
>
>
> Thanks
> Stephen
>
>
> This email and any attachments are confidential and may also be privileged.
> If you are not the intended recipient, please delete all copies and notify
> the sender immediately. You may wish to refer to the incorporation details
> of Standard Chartered PLC, Standard Chartered Bank and their subsidiaries at
> https://www.sc.com/en/incorporation-details.html

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: BinaryClassificationMetrics - get raw tp/fp/tn/fn stats per threshold?

2016-09-02 Thread Sean Owen

Given recall by threshold, you can compute true positive count per
threshold by just multiplying through by the count of elements where
label = 1. From that you can get false negatives by subtracting from
that same count.

Given precision by threshold, and true positives count by threshold,
you can work out how the count of elements that were predicted as
positive, by threshold. That should then get you to false positive
count, and from there you can similarly work out false negatives.

So I think you can reverse engineer those stats from precision and
recall without much trouble. I don't think there's a more direct way,
no, unless you want to clone the code and/or hack into it with
reflection.

On Fri, Sep 2, 2016 at 2:54 PM, Spencer, Alex (Santander)
 wrote:
> Hi,
>
>
>
> BinaryClassificationMetrics expose recall and precision byThreshold. Is
> there a way to true negatives / false negatives etc per threshold?
>
>
>
> I have weighted my genuines and would like the adjusted precision / FPR.
> (Unless there is an option that I’ve missed, although I have been over the
> Class twice now and can’t see any weighting options). I had to build my own,
> which seems a bit like reinventing the wheel (isn’t as safe + fast for a
> start):
>
>
>
> val threshold_stats =
> metrics.thresholds.cartesian(predictionAndLabels).map{case (t, (prob,
> label)) =>
>
>   val selected = (prob >= t)
>
>   val fraud = (label == 1.0)
>
>
>
>   val tp = if (fraud && selected) 1 else 0
>
>   val fp = if (!fraud && selected) 1 else 0
>
>   val tn = if (!fraud && !selected) 1 else 0
>
>   val fn = if (fraud && !selected) 1 else 0
>
>
>
>   (t, (tp, fp, tn, fn))
>
> }.reduceByKey((x, y) => (x._1 + y._1, x._2 + y._2, x._3 + y._3, x._4 +
> y._4))
>
>
>
> Kind Regards,
>
> Alex.
>
>
> Emails aren't always secure, and they may be intercepted or changed after
> they've been sent. Santander doesn't accept liability if this happens. If
> you
> think someone may have interfered with this email, please get in touch with
> the
> sender another way. This message doesn't create or change any contract.
> Santander doesn't accept responsibility for damage caused by any viruses
> contained in this email or its attachments. Emails may be monitored. If
> you've
> received this email by mistake, please let the sender know at once that it's
> gone to the wrong person and then destroy it without copying, using, or
> telling
> anyone about its contents.
> Santander UK plc Reg. No. 2294747 and Abbey National Treasury Services plc
> Reg.
> No. 2338548 Registered Offices: 2 Triton Square, Regent's Place, London NW1
> 3AN.
> Registered in England. www.santander.co.uk. Authorised by the Prudential
> Regulation Authority and regulated by the Financial Conduct Authority and
> the
> Prudential Regulation Authority. FCA Reg. No. 106054 and 146003
> respectively.
> Santander Sharedealing is a trading name of Abbey Stockbrokers Limited Reg.
> No.
> 02666793. Registered Office: Kingfisher House, Radford Way, Billericay,
> Essex
> CM12 0GZ. Authorised and regulated by the Financial Conduct Authority. FCA
> Reg.
> No. 154210. You can check this on the Financial Services Register by
> visiting
> the FCA’s website www.fca.org.uk/register or by contacting the FCA on 0800
> 111
> 6768. Santander UK plc is also licensed by the Financial Supervision
> Commission
> of the Isle of Man for its branch in the Isle of Man. Deposits held with the
> Isle of Man branch are covered by the Isle of Man Depositors’ Compensation
> Scheme as set out in the Isle of Man Depositors’ Compensation Scheme
> Regulations
> 2010. In the Isle of Man, Santander UK plc’s principal place of business is
> at
> 19/21 Prospect Hill, Douglas, Isle of Man, IM1 1ET. Santander and the flame
> logo
> are registered trademarks.
> Santander Asset Finance plc. Reg. No. 1533123. Registered Office: 2 Triton
> Square, Regent’s Place, London NW1 3AN. Registered in England. Santander
> Corporate & Commercial is a brand name used by Santander UK plc, Abbey
> National
> Treasury Services plc and Santander Asset Finance plc.
> Ref:[PDB#1-4A]
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: PySpark: preference for Python 2.7 or Python 3.5?

2016-09-02 Thread Sean Owen

Spark should work fine with Python 3. I'm not a Python person, but all else
equal I'd use 3.5 too. I assume the issue could be libraries you want that
don't support Python 3. I don't think that changes with CDH. It includes a
version of Anaconda from Continuum, but that lays down Python 2.7.11. I
don't believe there's any particular position on 2 vs 3.

On Fri, Sep 2, 2016 at 3:56 AM, Ian Stokes Rees 
wrote:

> I have the option of running PySpark with Python 2.7 or Python 3.5.  I am
> fairly expert with Python and know the Python-side history of the
> differences.  All else being the same, I have a preference for Python 3.5.
> I'm using CDH 5.8 and I'm wondering if that biases whether I should proceed
> with PySpark on top of Python 2.7 or 3.5.  Opinions?  Does Cloudera have an
> official (or unofficial) position on this?
>
> Thanks,
>
> Ian
> ___
> Ian Stokes-Rees
> Computational Scientist
>
> [image: Continuum Analytics] 
> @ijstokes [image: Twitter]  [image: LinkedIn]
>  [image: Github]
>  617.942.0218
>

Re: Difference between Data set and Data Frame in Spark 2

2016-09-01 Thread Sean Owen

On Thu, Sep 1, 2016 at 4:56 PM, Mich Talebzadeh
 wrote:
> Data Frame built on top of RDD to create as tabular format that we all love
> to make the original build easily usable (say SQL like queries, column
> headings etc). The drawback is it restricts you with what you can do with
> Data Frame (now that you have dome RDD.toDF)

DataFrame is a Dataset[Row], literally, rather than based on an RDD.

> DataSet  is the new RDD with improvements on RDD. As I understand from
> Sean's explanation they add some optimisation on top the common RDD.

At the moment I don't think there's any particular reason to use RDDs
except to interoperate with code that uses RDDs -- which is entirely
valid. I believe new code would generally touch only Dataset and
DataFrame otherwise. So I don't think there are really 3 elemental
concepts in play as of Spark 2.x.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Difference between Data set and Data Frame in Spark 2

2016-09-01 Thread Sean Owen

Here's my paraphrase:

Datasets are really the new RDDs. They have a similar nature
(container of strongly-typed objects) but bring some optimizations via
Encoders for common types.

DataFrames are different from RDDs and Datasets and do not replace and
are not replaced by them. They're fundamentally for tabular data, not
arbitrary objects, and thus supports SQL-like operations that only
make sense on tabular  data.

On Thu, Sep 1, 2016 at 3:17 PM, Ashok Kumar
 wrote:
> Hi,
>
> What are practical differences between the new Data set in Spark 2 and the
> existing DataFrame.
>
> Has Dataset replaced Data Frame and what advantages it has if I use Data
> Frame instead of Data Frame.
>
> Thanks
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark 2.0.0 - has anyone used spark ML to do predictions under 20ms?

2016-09-01 Thread Sean Owen

Yeah there's a method to predict one Vector in the .mllib API but not
the newer one. You could possibly hack your way into calling it
anyway, or just clone the logic.

On Thu, Sep 1, 2016 at 2:37 PM, Nick Pentreath  wrote:
> Right now you are correct that Spark ML APIs do not support predicting on a
> single instance (whether Vector for the models or a Row for a pipeline).
>
> See https://issues.apache.org/jira/browse/SPARK-10413 and
> https://issues.apache.org/jira/browse/SPARK-16431 (duplicate) for some
> discussion.
>
> There may be movement in the short term to support the single Vector case.
> But anything for pipelines is not immediately on the horizon I'd say.
>
> N

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark 2.0.0 - has anyone used spark ML to do predictions under 20ms?

2016-09-01 Thread Sean Owen

How the model is built isn't that related to how it scores things.
Here we're just talking about scoring. NaiveBayesModel can score
Vector which is not a distributed entity. That's what you want to use.
You do not want to use a whole distributed operation to score one
record. This isn't related to .ml vs .mllib APIs.

On Thu, Sep 1, 2016 at 2:01 PM, Aseem Bansal <asmbans...@gmail.com> wrote:
> I understand your point.
>
> Is there something like a bridge? Is it possible to convert the model
> trained using Dataset (i.e. the distributed one) to the one which uses
> vectors? In Spark 1.6 the mllib packages had everything as per vectors and
> that should be faster as per my understanding. But in many spark blogs we
> saw that spark is moving towards the ml package and mllib package will be
> phased out. So how can someone train using huge data and then use it on a
> row by row basis?
>
> Thanks for your inputs.
>
> On Thu, Sep 1, 2016 at 6:15 PM, Sean Owen <so...@cloudera.com> wrote:
>>
>> If you're trying to score a single example by way of an RDD or
>> Dataset, then no it will never be that fast. It's a whole distributed
>> operation, and while you might manage low latency for one job at a
>> time, consider what will happen when hundreds of them are running at
>> once. It's just huge overkill for scoring a single example (but,
>> pretty fine for high-er latency, high throughput batch operations)
>>
>> However if you're scoring a Vector locally I can't imagine it's that
>> slow. It does some linear algebra but it's not that complicated. Even
>> something unoptimized should be fast.
>>
>> On Thu, Sep 1, 2016 at 1:37 PM, Aseem Bansal <asmbans...@gmail.com> wrote:
>> > Hi
>> >
>> > Currently trying to use NaiveBayes to make predictions. But facing
>> > issues
>> > that doing the predictions takes order of few seconds. I tried with
>> > other
>> > model examples shipped with Spark but they also ran in minimum of 500 ms
>> > when I used Scala API. With
>> >
>> > Has anyone used spark ML to do predictions for a single row under 20 ms?
>> >
>> > I am not doing premature optimization. The use case is that we are doing
>> > real time predictions and we need results 20ms. Maximum 30ms. This is a
>> > hard
>> > limit for our use case.
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark 2.0.0 - has anyone used spark ML to do predictions under 20ms?

2016-09-01 Thread Sean Owen

If you're trying to score a single example by way of an RDD or
Dataset, then no it will never be that fast. It's a whole distributed
operation, and while you might manage low latency for one job at a
time, consider what will happen when hundreds of them are running at
once. It's just huge overkill for scoring a single example (but,
pretty fine for high-er latency, high throughput batch operations)

However if you're scoring a Vector locally I can't imagine it's that
slow. It does some linear algebra but it's not that complicated. Even
something unoptimized should be fast.

On Thu, Sep 1, 2016 at 1:37 PM, Aseem Bansal  wrote:
> Hi
>
> Currently trying to use NaiveBayes to make predictions. But facing issues
> that doing the predictions takes order of few seconds. I tried with other
> model examples shipped with Spark but they also ran in minimum of 500 ms
> when I used Scala API. With
>
> Has anyone used spark ML to do predictions for a single row under 20 ms?
>
> I am not doing premature optimization. The use case is that we are doing
> real time predictions and we need results 20ms. Maximum 30ms. This is a hard
> limit for our use case.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Why there is no top method in dataset api

2016-09-01 Thread Sean Owen

You can always call .rdd.top(n) of course. Although it's slightly
clunky, you can also .orderBy($"value".desc).take(n). Maybe there's an
easier way.

I don't think if there's a strong reason other than it wasn't worth it
to write this and many other utility wrappers that a) already exist on
the underlying RDD API if you want them, and b) have a DataFrame-like
counterpart already that doesn't really need wrapping in a different
API.

On Thu, Sep 1, 2016 at 12:53 PM, Jakub Dubovsky
 wrote:
> Hey all,
>
> in RDD api there is very usefull method called top. It finds top n records
> in according to certain ordering without sorting all records. Very usefull!
>
> There is no top method nor similar functionality in Dataset api. Has anybody
> any clue why? Is there any specific reason for this?
>
> Any thoughts?
>
> thanks
>
> Jakub D.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark 2.0.0 - Java vs Scala performance difference

2016-09-01 Thread Sean Owen

I can't think of a situation where it would be materially different.
Both are using the JVM-based APIs directly. Here and there there's a
tiny bit of overhead in using the Java APIs because something is
translated from a Java-style object to a Scala-style object, but this
is generally trivial.

On Thu, Sep 1, 2016 at 10:06 AM, Aseem Bansal  wrote:
> Hi
>
> Would there be any significant performance difference when using Java vs.
> Scala API?

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: difference between package and jar Option in Spark

2016-09-01 Thread Sean Owen

--jars includes a local JAR file in the application's classpath.
--package references Maven coordinates of a dependency and retrieves
and includes all of those JAR files, and includes them in the app
classpath.

On Thu, Sep 1, 2016 at 10:24 AM, Divya Gehlot  wrote:
> Hi,
>
> Would like to know the difference between the --package and --jars option in
> Spark .
>
>
>
> Thanks,
> Divya

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Model abstract class in spark ml

2016-08-31 Thread Sean Owen

Weird, I recompiled Spark with a similar change to Model and it seemed
to work but maybe I missed a step in there.

On Wed, Aug 31, 2016 at 6:33 AM, Mohit Jaggi  wrote:
> I think I figured it out. There is indeed "something deeper in Scala” :-)
>
> abstract class A {
>   def a: this.type
> }
>
> class AA(i: Int) extends A {
>   def a = this
> }
>
> the above works ok. But if you return anything other than “this”, you will
> get a compile error.
>
> abstract class A {
>   def a: this.type
> }
>
> class AA(i: Int) extends A {
>   def a = new AA(1)
> }
>
> Error:(33, 11) type mismatch;
>  found   : com.dataorchard.datagears.AA
>  required: AA.this.type
>   def a = new AA(1)
>   ^
>
> So you have to do:
>
> abstract class A[T <: A[T]]  {
>   def a: T
> }
>
> class AA(i: Int) extends A[AA] {
>   def a = new AA(1)
> }
>
>
>
> Mohit Jaggi
> Founder,
> Data Orchard LLC
> www.dataorchardllc.com
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Model abstract class in spark ml

2016-08-30 Thread Sean Owen

I think it's imitating, for example, how Enum is delcared in Java:

abstract class Enum>

this is done so that Enum can refer to the actual type of the derived
enum class when declaring things like public final int compareTo(E o)
to implement Comparable. The type is redundant in a sense, because
you effectively have MyEnum extending Enum.

Java allows this self-referential definition. However Scala has
"this.type" for this purpose and (unless I'm about to learn something
deeper about Scala) it would have been the better way to express this
so that Model methods can for example state that copy() returns a
Model of the same concrete type.

I don't know if it can be changed now without breaking compatibility
but you're welcome to give it a shot with MiMa to see. It does
compile, using this.type.

On Tue, Aug 30, 2016 at 9:47 PM, Mohit Jaggi  wrote:
> Folks,
> I am having a bit of trouble understanding the following:
>
> abstract class Model[M <: Model[M]]
>
> Why is M <: Model[M]?
>
> Cheers,
> Mohit.
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Coding in the Spark ml "ecosystem" why is everything private?!

2016-08-29 Thread Sean Owen

If something isn't public, then it could change across even
maintenance releases. Although you can indeed still access it in some
cases by writing code in the same package, you're taking some risk
that it will stop working across releases.

If it's not public, the message is that you should build it yourself,
yes. For example, OpenHashSet will never be meant to be reused. You
can use your own from a library.

If there's a clear opportunity to expose something cleanly you can
bring it up for discussion. But it's never just a matter of making
something public. Making it public means committing others' time to
supporting it as-is for years. It would have to be worth it.

On Mon, Aug 29, 2016 at 5:46 PM, Thunder Stumpges
 wrote:
> Hi all,
>
> I'm not sure if this belongs here in users or over in dev as I guess it's
> somewhere in between. We have been starting to implement some machine
> learning pipelines, and it seemed from the documentation that Spark had a
> fairly well thought-out platform (see:
> http://spark.apache.org/docs/1.6.1/ml-guide.html )
>
> I liked the design of Transformers, Models, Estimators, Pipelines, etc.
> However as soon as we began attempting to code our first ones, we began
> running into one class or method after another that has been marked
> private... Some examples are:
>
> - SchemaUtils - (for validating schemas passed in and out, and adding output
> columns to DataFrames)
> - Loader / Saveable (traits / helpers for saving and loading models)
> - Several classes under 'collection' namespace like OpenHashSet /
> OpenHashMap
> - All of the underlying linear algebra Breeze details
> - Other classes specific to certain models. We are writing an alternative
> LDA Optimizer / Trainer and everything under LDAUtils is private.
>
> I'd like to ask what the expected approach is here. I see a few options,
> none of which seem appropriate:
>
> 1. Implement everything in the org.apache.spark.* namespaces to match
> package privates
> - will this even work in our own modules ?
> - we would be open to contributing some of our code back but not sure
> the project wants it
> 2. Implement our own versions of all of these things.
>- lots of extra work for us, leads to unseen gotchas in implementations
> and other unforseen issues
> 3. Copy classes into our namespace for use
>- duplicates code, leads to code diversion as the main code is kept up to
> date.
>
> Thanks in advance for any recommendations on this frustrating issue.
> Thunder
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: How can we connect RDD from previous job to next job

2016-08-29 Thread Sean Owen

If you mean to persist data in an RDD, then you should do just that --
persist the RDD to durable storage so it can be read later by any
other app. Checkpointing is not a way to store RDDs, but a specific
way to recover the same application in some cases. Parquet has been
supported for a long while, yes. It's the most common binary format.
You could also literally store the serialized form of your objects.

On Mon, Aug 29, 2016 at 9:27 AM, Sachin Mittal <sjmit...@gmail.com> wrote:
> I understood the approach.
> Does spark 1.6 support Parquet format, I mean saving and loading from
> Parquet file.
>
> Also if I use checkpoint, what I understand is that RDD location on
> filesystem is not removed when job is over. So I can read that RDD in next
> job.
> Is that one of the usecase of checkpoint. Basically does my current problem
> can be solved using checkpoint.
>
> Also which option would be better, store the output of RDD to a persistent
> storage, or store the new RDD of that ouput itself using checkpoint.
>
> Thanks
> Sachin
>
>
>
>
> On Mon, Aug 29, 2016 at 1:39 PM, Sean Owen <so...@cloudera.com> wrote:
>>
>> You just save the data in the RDD in whatever form you want to
>> whatever persistent storage you want, and then re-read it from another
>> job. This could be Parquet format on HDFS for example. Parquet is just
>> a common file format. There is no need to keep the job running just to
>> keep an RDD alive.
>>
>> On Mon, Aug 29, 2016 at 5:30 AM, Sachin Mittal <sjmit...@gmail.com> wrote:
>> > Hi,
>> > I would need some thoughts or inputs or any starting point to achieve
>> > following scenario.
>> > I submit a job using spark-submit with a certain set of parameters.
>> >
>> > It reads data from a source, does some processing on RDDs and generates
>> > some
>> > output and completes.
>> >
>> > Then I submit same job again with next set of parameters.
>> > It should also read data from a source do same processing and at the
>> > same
>> > time read data from the result generated by previous job and merge the
>> > two
>> > and again store the results.
>> >
>> > This process goes on and on.
>> >
>> > So I need to store RDD or output of RDD into some storage of previous
>> > job to
>> > make it available to next job.
>> >
>> > What are my options.
>> > 1. Use checkpoint
>> > Can I use checkpoint on the final stage of RDD and then load the same
>> > RDD
>> > again by specifying checkpoint path in next job. Is checkpoint right for
>> > this kind of situation.
>> >
>> > 2. Save output of previous job into some json file and then create a
>> > data
>> > frame of that in next job.
>> > Have I got this right, is this option better than option 1.
>> >
>> > 3. I have heard a lot about paquet files. However I don't know how it
>> > integrates with spark.
>> > Can I use that here as intermediate storage.
>> > Is this available in spark 1.6?
>> >
>> > Any other thoughts or idea.
>> >
>> > Thanks
>> > Sachin
>> >
>> >
>> >
>> >
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: How can we connect RDD from previous job to next job

2016-08-29 Thread Sean Owen

You just save the data in the RDD in whatever form you want to
whatever persistent storage you want, and then re-read it from another
job. This could be Parquet format on HDFS for example. Parquet is just
a common file format. There is no need to keep the job running just to
keep an RDD alive.

On Mon, Aug 29, 2016 at 5:30 AM, Sachin Mittal  wrote:
> Hi,
> I would need some thoughts or inputs or any starting point to achieve
> following scenario.
> I submit a job using spark-submit with a certain set of parameters.
>
> It reads data from a source, does some processing on RDDs and generates some
> output and completes.
>
> Then I submit same job again with next set of parameters.
> It should also read data from a source do same processing and at the same
> time read data from the result generated by previous job and merge the two
> and again store the results.
>
> This process goes on and on.
>
> So I need to store RDD or output of RDD into some storage of previous job to
> make it available to next job.
>
> What are my options.
> 1. Use checkpoint
> Can I use checkpoint on the final stage of RDD and then load the same RDD
> again by specifying checkpoint path in next job. Is checkpoint right for
> this kind of situation.
>
> 2. Save output of previous job into some json file and then create a data
> frame of that in next job.
> Have I got this right, is this option better than option 1.
>
> 3. I have heard a lot about paquet files. However I don't know how it
> integrates with spark.
> Can I use that here as intermediate storage.
> Is this available in spark 1.6?
>
> Any other thoughts or idea.
>
> Thanks
> Sachin
>
>
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark StringType could hold how many characters ?

2016-08-28 Thread Sean Owen

No, it is just being truncated for display as the ... implies. Pass
truncate=false to the show command.

On Sun, Aug 28, 2016, 15:24 Kevin Tran  wrote:

> Hi,
> I wrote to parquet file as following:
>
> ++
> |word|
> ++
> |THIS IS MY CHARACTERS ...|
> |// ANOTHER LINE OF CHAC...|
> ++
>
> These lines are not full text and it is being trimmed down.
> Does anyone know how many chacters StringType could handle ?
>
> In the Spark code:
> org.apache.spark.sql.types.StringType
> /**
>* The default size of a value of the StringType is 4096 bytes.
>*/
>   override def defaultSize: Int = 4096
>
>
> Thanks,
> Kevin.
>

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Sean Owen

Without a distributed storage system, your application can only create data
on the driver and send it out to the workers, and collect data back from
the workers. You can't read or write data in a distributed way. There are
use cases for this, but pretty limited (unless you're running on 1 machine).

I can't really imagine a serious use of (distributed) Spark without
(distribute) storage, in a way I don't think many apps exist that don't
read/write data.

The premise here is not just replication, but partitioning data across
compute resources. With a distributed file system, your big input exists
across a bunch of machines and you can send the work to the pieces of data.

On Thu, Aug 25, 2016 at 7:57 PM, kant kodali  wrote:

> @Mich I understand why I would need Zookeeper. It is there for fault
> tolerance given that spark is a master-slave architecture and when a mater
> goes down zookeeper will run a leader election algorithm to elect a new
> leader however DevOps hate Zookeeper they would be much happier to go with
> etcd & consul and looks like if we mesos scheduler we should be able to
> drop Zookeeper.
>
> HDFS I am still trying to understand why I would need for spark. I
> understand the purpose of distributed file systems in general but I don't
> understand in the context of spark since many people say you can run a
> spark distributed cluster in a stand alone mode but I am not sure what are
> its pros/cons if we do it that way. In a hadoop world I understand that one
> of the reasons HDFS is there is for replication other words if we write
> some data to a HDFS it will store that block across different nodes such
> that if one of nodes goes down it can still retrieve that block from other
> nodes. In the context of spark I am not really sure because 1) I am new 2)
> Spark paper says it doesn't replicate data instead it stores the
> lineage(all the transformations) such that it can reconstruct it.
>
>
>
>
>
>
> On Thu, Aug 25, 2016 9:18 AM, Mich Talebzadeh mich.talebza...@gmail.com
> wrote:
>
>> You can use Spark on Oracle as a query tool.
>>
>> It all depends on the mode of the operation.
>>
>> If you running Spark with yarn-client/cluster then you will need yarn. It
>> comes as part of Hadoop core (HDFS, Map-reduce and Yarn).
>>
>> I have not gone and installed Yarn without installing Hadoop.
>>
>> What is the overriding reason to have the Spark on its own?
>>
>>  You can use Spark in Local or Standalone mode if you do not want Hadoop
>> core.
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 24 August 2016 at 21:54, kant kodali  wrote:
>>
>> What do I loose if I run spark without using HDFS or Zookeper ? which of
>> them is almost a must in practice?
>>
>>
>>

Re: Sqoop vs spark jdbc

2016-08-25 Thread Sean Owen

Sqoop is probably the more mature tool for the job. It also just does
one thing. The argument for doing it in Spark would be wanting to
integrate it with a larger workflow. I imagine Sqoop would be more
efficient and flexible for just the task of ingest, including
continuously pulling deltas which I am not sure Spark really does for
you.

MapReduce won't matter here. The bottleneck is reading from the RDBMS
in general.

On Wed, Aug 24, 2016 at 11:07 PM, Mich Talebzadeh
 wrote:
> Personally I prefer Spark JDBC.
>
> Both Sqoop and Spark rely on the same drivers.
>
> I think Spark is faster and if you have many nodes you can partition your
> incoming data and take advantage of Spark DAG + in memory offering.
>
> By default Sqoop will use Map-reduce which is pretty slow.
>
> Remember for Spark you will need to have sufficient memory
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> Disclaimer: Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed. The
> author will in no case be liable for any monetary damages arising from such
> loss, damage or destruction.
>
>
>
>
> On 24 August 2016 at 22:39, Venkata Penikalapati
>  wrote:
>>
>> Team,
>> Please help me in choosing sqoop or spark jdbc to fetch data from rdbms.
>> Sqoop has lot of optimizations to fetch data does spark jdbc also has those
>> ?
>>
>> I'm performing few analytics using spark data for which data is residing
>> in rdbms.
>>
>> Please guide me with this.
>>
>>
>> Thanks
>> Venkata Karthik P
>>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Are RDD's ever persisted to disk?

2016-08-23 Thread Sean Owen

We're probably mixing up some semantics here. An RDD is indeed,
really, just some bookkeeping that records how a certain result is
computed. It is not the data itself.

However we often talk about "persisting an RDD" which means
"persisting the result of computing the RDD" in which case that
persisted representation can be used instead of recomputing it.

The result of computing an RDD is really some objects in memory. It's
possible to persist the RDD in memory by just storing these objects in
memory as cached partitions. This involves no serialization.

Data can be persisted to disk but this involves serializing objects to
bytes (not byte code). It's also possible to store a serialized
representation in memory because it may be more compact.

This is not the same as saving/writing an RDD to persistent storage as
text or JSON or whatever.

On Tue, Aug 23, 2016 at 9:28 PM, kant kodali  wrote:
> @srkanth are you sure? the whole point of RDD's is to store transformations
> but not the data as the spark paper points out but I do lack the practical
> experience for me to confirm. when I looked at the spark source
> code(specifically the checkpoint code) a while ago it was clearly storing
> some JVM byte code to disk which I thought were the transformations.
>
>
>
> On Tue, Aug 23, 2016 1:11 PM, srikanth.je...@gmail.com wrote:
>>
>> RDD contains data but not JVM byte code i.e. data which is read from
>> source and transformations have been applied. This is ideal case to persist
>> RDDs.. As Nirav mentioned this data will be serialized before persisting to
>> disk..
>>
>>
>>
>> Thanks,
>> Sreekanth Jella
>>
>>
>>
>> From: kant kodali
>> Sent: Tuesday, August 23, 2016 3:59 PM
>> To: Nirav
>> Cc: RK Aduri; srikanth.je...@gmail.com; user@spark.apache.org
>> Subject: Re: Are RDD's ever persisted to disk?
>>
>>
>>
>> Storing RDD to disk is nothing but storing JVM byte code to disk (in case
>> of Java or Scala). am I correct?
>>
>>
>>
>>
>>
>> On Tue, Aug 23, 2016 12:55 PM, Nirav nira...@gmail.com wrote:
>>
>> You can store either in serialized form(butter array) or just save it in a
>> string format like tsv or csv. There are different RDD save apis for that.
>>
>> Sent from my iPhone
>>
>>
>> On Aug 23, 2016, at 12:26 PM, kant kodali  wrote:
>>
>> ok now that I understand RDD can be stored to the disk. My last question
>> on this topic would be this.
>>
>>
>>
>> Storing RDD to disk is nothing but storing JVM byte code to disk (in case
>> of Java or Scala). am I correct?
>>
>>
>>
>>
>>
>> On Tue, Aug 23, 2016 12:19 PM, RK Aduri rkad...@collectivei.com wrote:
>>
>> On an other note, if you have a streaming app, you checkpoint the RDDs so
>> that they can be accessed in case of a failure. And yes, RDDs are persisted
>> to DISK. You can access spark’s UI and see it listed under Storage tab.
>>
>>
>>
>> If RDDs are persisted in memory, you avoid any disk I/Os so that any
>> lookups will be cheap. RDDs are reconstructed based on a graph (DAG -
>> available in Spark UI )
>>
>>
>>
>> On Aug 23, 2016, at 12:10 PM, 
>>  wrote:
>>
>>
>>
>> RAM or Virtual memory is finite, so data size needs to be considered
>> before persist. Please see below documentation when to choose the
>> persistency level.
>>
>>
>>
>>
>> http://spark.apache.org/docs/latest/programming-guide.html#which-storage-level-to-choose
>>
>>
>>
>> Thanks,
>> Sreekanth Jella
>>
>>
>>
>> From: kant kodali
>> Sent: Tuesday, August 23, 2016 2:42 PM
>> To: srikanth.je...@gmail.com
>> Cc: user@spark.apache.org
>> Subject: Re: Are RDD's ever persisted to disk?
>>
>>
>>
>> so when do we ever need to persist RDD on disk? given that we don't need
>> to worry about RAM(memory) as virtual memory will just push pages to the
>> disk when memory becomes scarce.
>>
>>
>>
>>
>>
>> On Tue, Aug 23, 2016 11:23 AM, srikanth.je...@gmail.com wrote:
>>
>> Hi Kant Kodali,
>>
>>
>>
>> Based on the input parameter to persist() method either it will be cached
>> on memory or persisted to disk. In case of failures Spark will reconstruct
>> the RDD on a different executor based on the DAG. That is how failures are
>> handled. Spark Core does not replicate the RDDs as they can be reconstructed
>> from the source (let’s say HDFS, Hive or S3 etc.) but not from memory (which
>> is lost already).
>>
>>
>>
>> Thanks,
>> Sreekanth Jella
>>
>>
>>
>> From: kant kodali
>> Sent: Tuesday, August 23, 2016 2:12 PM
>> To: user@spark.apache.org
>> Subject: Are RDD's ever persisted to disk?
>>
>>
>>
>> I am new to spark and I keep hearing that RDD's can be persisted to memory
>> or disk after each checkpoint. I wonder why RDD's are persisted in memory?
>> In case of node failure how would you access memory to reconstruct the RDD?
>> persisting to disk make sense because its like persisting to a Network file
>> system (in case of HDFS) where a each block will have multiple copies across
>> nodes so if a node goes down

Re: Dataframe corrupted when sqlContext.read.json on a Gzipped file that contains more than one file

2016-08-21 Thread Sean Owen

You are attempting to read a tar file. That won't work. A compressed JSON
file would.

On Sun, Aug 21, 2016, 12:52 Chua Jie Sheng  wrote:

> Hi Spark user list!
>
> I have been encountering corrupted records when reading Gzipped files that
> contains more than one file.
>
> Example:
> I have two .json file, [a.json, b.json]
> Each have multiple records (one line, one record).
>
> I tar both of them together on
>
> Mac OS X, 10.11.6
> bsdtar 2.8.3 - libarchive 2.8.3
>
> i.e. tar -czf a.tgz *.json
>
>
> When I attempt to read them (via Python):
>
> filename = "a.tgz"
> sqlContext = SQLContext(sc)
> datasets = sqlContext.read.json(filename)
>
> datasets.show(1, truncate=False)
>
>
> My first record will always be corrupted, showing up in _corrupt_record.
>
> Does anyone have any idea if it is feature or a defect?
>
> Best Regards
> Jie Sheng
>
> Important: This email is confidential and may be privileged. If you are
> not the intended recipient, please delete it and notify us immediately; you
> should not copy or use it for any purpose, nor disclose its contents to any
> other person. Thank you.
>

Re: Spark 2.0 regression when querying very wide data frames

2016-08-20 Thread Sean Owen

Yes, have a look through JIRA in cases like this.
https://issues.apache.org/jira/browse/SPARK-16664

On Sat, Aug 20, 2016 at 1:57 AM, mhornbech  wrote:
> I did some extra digging. Running the query "select column1 from myTable" I
> can reproduce the problem on a frame with a single row - it occurs exactly
> when the frame has more than 200 columns, which smells a bit like a
> hardcoded limit.
>
> Interestingly the problem disappears when replacing the query with "select
> column1 from myTable limit N" where N is arbitrary. However it appears again
> when running "select * from myTable limit N" with sufficiently many columns
> (haven't determined the exact threshold here).
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-2-0-regression-when-querying-very-wide-data-frames-tp27567p27568.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: 2.0.1/2.1.x release dates

2016-08-18 Thread Sean Owen

Historically, minor releases happen every ~4 months, and maintenance
releases are a bit ad hoc but come about a month after the minor
release. It's up to the release manager to decide to do them but maybe
realistic to expect 2.0.1 in early September.

On Thu, Aug 18, 2016 at 10:35 AM, Adrian Bridgett  wrote:
> Just wondering if there were any rumoured release dates for either of the
> above.  I'm seeing some odd hangs with 2.0.0 and mesos (and I know that the
> mesos integration has had a bit of updating in 2.1.x).   Looking at JIRA,
> there's no suggested release date and issues seem to be added to a release
> version once resolved so the usual trick of looking at the
> resolved/unresolved ratio isn't helping :-)  The wiki only mentions 2.0.0 so
> no joy there either.
>
> Still doing testing but then I don't want to test with 2.1.x if it's going
> to be under heavy development for a while longer.
>
> Thanks for any info,
>
> Adrian
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: DataFrame use case

2016-08-16 Thread Sean Owen

I'd say that Datasets, not DataFrames, are the natural evolution of
RDDs. DataFrames are for inherently tabular data, and most naturally
manipulated by SQL-like operations. Datasets operate on programming
language objects like RDDs.

So, RDDs to DataFrames isn't quite apples-to-apples to begin with.
It's just never true that "X is always faster than Y" in a case like
this. Indeed your case doesn't sound like anything where a tabular
representation would be beneficial. There's overhead to treating it
like that. You're doing almost nothing to the data itself except
counting it, and RDDs have the lowest overhead of the three concepts
because they treat their contents as opaque objects anyway.

The benefit comes when you do things like SQL-like operations on
tabular data in the DataFrame API instead of RDD API. That's where
more optimization can kick in. Dataset brings some of the same
possible optimizations to an RDD-like API because it has more
knowledge of the type and nature of the entire data set.

If you're really only manipulating byte arrays, I don't know if
DataFrame adds anything. I know Dataset has some specialization for
byte[], so I'd expect you could see some storage benefits over RDDs,
maybe.

On Tue, Aug 16, 2016 at 6:32 PM, jtgenesis  wrote:
> Hey guys, I've been digging around trying to figure out if I should
> transition from RDDs to DataFrames. I'm currently using RDDs to represent
> tiles of binary imagery data and I'm wondering if representing the data as a
> DataFrame is a better solution.
>
> To get my feet wet, I did a little comparison on a Word Count application,
> on a 1GB file of random text, using an RDD and DataFrame. And I got the
> following results:
>
> RDD Count total: 137733312 Time Elapsed: 44.5675378 s
> DataFrame Count total: 137733312 Time Elapsed: 69.201253448 s
>
> I figured the DataFrame would outperform the RDD, since I've seen many
> sources that state superior speeds with DataFrames. These results could just
> be an implementation issue, unstructured data, or a result of the data
> source. I'm not really sure.
>
> This leads me to take a step back and figure out what applications are
> better suited with DataFrames than RDDs? In my case, while the original
> image file is unstructured. The data is loaded in a pairRDD, where the key
> contains multiple attributes that pertain to the value. The value is a chunk
> of the image represented as an array of bytes. Since, my data will be in a
> structured format, I don't see why I can't benefit from DataFrames. However,
> should I be concerned of any performance issues that pertain to
> processing/moving of byte array (each chunk is uniform size in the KB-MB
> range). I'll potentially be scanning the entire image, select specific image
> tiles and perform some work on them.
>
> If DataFrames are well suited for my use case, how does the data source
> affect my performance? I could always just load data into an RDD and convert
> to DataFrame, or I could convert the image into a parquet file and create
> DataFrames directly. Is one way recommended over the other?
>
> These are a lot of questions, and I'm still trying to ingest and make sense
> of everything. Any feedback would be greatly appreciated.
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/DataFrame-use-case-tp27543.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Number of tasks on executors become negative after executor failures

2016-08-15 Thread Sean Owen

-dev (this is appropriate for user@)

Probably https://issues.apache.org/jira/browse/SPARK-10141 or
https://issues.apache.org/jira/browse/SPARK-11334 but those aren't
resolved. Feel free to jump in.


On Mon, Aug 15, 2016 at 8:13 PM, Rachana Srivastava <
rachana.srivast...@markmonitor.com> wrote:

> *Summary:*
>
> I am running Spark 1.5 on CDH5.5.1.  Under extreme load intermittently I
> am getting this connection failure exception and later negative executor in
> the Spark UI.
>
>
>
> *Exception:*
>
> TRACE: org.apache.hadoop.hbase.ipc.AbstractRpcClient - Call: Multi,
> callTime: 76ms
>
> INFO : org.apache.spark.network.client.TransportClientFactory - Found
> inactive connection to /xxx.xxx.xxx., creating a new one.
>
> ERROR: org.apache.spark.network.shuffle.RetryingBlockFetcher - Exception
> while beginning fetch of 1 outstanding blocks (after 1 retries)
>
> java.io.IOException: Failed to connect to /xxx.xxx.xxx.
>
> at org.apache.spark.network.client.TransportClientFactory.
> createClient(TransportClientFactory.java:193)
>
> at org.apache.spark.network.client.TransportClientFactory.
> createClient(TransportClientFactory.java:156)
>
> at org.apache.spark.network.netty.
> NettyBlockTransferService$$anon$1.createAndStart(
> NettyBlockTransferService.scala:88)
>
> at org.apache.spark.network.shuffle.RetryingBlockFetcher.
> fetchAllOutstanding(RetryingBlockFetcher.java:140)
>
> at org.apache.spark.network.shuffle.RetryingBlockFetcher.
> access$200(RetryingBlockFetcher.java:43)
>
> at org.apache.spark.network.shuffle.RetryingBlockFetcher$
> 1.run(RetryingBlockFetcher.java:170)
>
> at java.util.concurrent.Executors$RunnableAdapter.
> call(Executors.java:471)
>
> at java.util.concurrent.FutureTask.run(FutureTask.
> java:262)
>
> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1145)
>
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:615)
>
> at java.lang.Thread.run(Thread.java:745)
>
> Caused by: java.net.ConnectException: Connection refused:
> /xxx.xxx.xxx.
>
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native
> Method)
>
> at sun.nio.ch.SocketChannelImpl.finishConnect(
> SocketChannelImpl.java:739)
>
> at io.netty.channel.socket.nio.NioSocketChannel.
> doFinishConnect(NioSocketChannel.java:224)
>
> at io.netty.channel.nio.AbstractNioChannel$
> AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
>
> at io.netty.channel.nio.NioEventLoop.processSelectedKey(
> NioEventLoop.java:528)
>
> at io.netty.channel.nio.NioEventLoop.
> processSelectedKeysOptimized(NioEventLoop.java:468)
>
> at io.netty.channel.nio.NioEventLoop.processSelectedKeys(
> NioEventLoop.java:382)
>
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.
> java:354)
>
> at io.netty.util.concurrent.SingleThreadEventExecutor$2.
> run(SingleThreadEventExecutor.java:111)
>
> ... 1 more
>
>
>
>
>
> *Related Defects*:
>
> https://issues.apache.org/jira/browse/SPARK-2319
>
> https://issues.apache.org/jira/browse/SPARK-9591
>
>
>
>
>
>

Re: spark ml : auc on extreme distributed data

2016-08-15 Thread Sean Owen

Class imbalance can be an issue for algorithms, but decision forests
should in general cope reasonably well with imbalanced classes. By
default, positive and negative classes are treated 'equally' however,
and that may not reflect reality in some cases. Upsampling the
under-represented case is a crude but effective way to counter this.

Of course the model depends on the data distribution, but it also
depends on the data, of course. And the ROC curve depends on the model
and data. There is no inherent relationship between the class balance
and ROC curve though.

AUC for a random-guessing classifier should be ~0.5. 0.8 is generally
good. I could believe that this doesn't change much just because you
changed parameters or representation.

This isn't really a Spark question per se so you might get some other
answers on the Data Science or Stats StackExchange.

On Mon, Aug 15, 2016 at 5:11 AM, Zhiliang Zhu
 wrote:
> Hi All,
>
> Here I have lot of data with around 1,000,000 rows, 97% of them are negative
> class and 3% of them are positive class .
> I applied Random Forest algorithm to build the model and predict the testing
> data.
>
> For the data preparation,
> i. firstly randomly split all the data as training data and testing data by
> 0.7 : 0.3
> ii. let the testing data unchanged, its negative and positive class ratio
> would still be 97:3
> iii. try to make the training data negative and positive class ratio as
> 50:50, by way of sample algorithm in the different classes
> iv. get RF model by training data and predict testing data
>
> by modifying algorithm parameters and feature work (PCA etc ), it seems that
> the auc on the testing data is always above 0.8, or much more higher ...
>
> Then I lose into some confusion... It seems that the model or auc depends a
> lot on the original data distribution...
> In effect, I would like to know, for this data distribution, how its auc
> would be for random guess?
> What the auc would be for any kind of data distribution?
>
> Thanks in advance~~

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Standardization with Sparse Vectors

2016-08-11 Thread Sean Owen

I should be more clear, since the outcome of the discussion above was
not that obvious actually.

- I agree a change should be made to StandardScaler, and not VectorAssembler
- However I do think withMean should still be false by default and be
explicitly enabled
- The 'offset' idea is orthogonal, and as Nick says may be problematic
anyway a step or two down the line. I'm proposing just converting to
dense vectors if asked to center (which is why it shouldn't be the
default)

Indeed to answer your question, that's how I had resolved this in user
code earlier. It's the same thing you're suggesting here, to make a
UDF that converts the vectors to dense vectors manually.

I updated the JIRA accordingly, to suggest converting to DenseVector
in StandardScaler if withMean is set explicitly to true. I think we
should consider something like the 'offset' idea separately if at all.

On Thu, Aug 11, 2016 at 11:02 AM, Sean Owen <so...@cloudera.com> wrote:
> No, that doesn't describe the change being discussed, since you've
> copied the discussion about adding an 'offset'. That's orthogonal.
> You're also suggesting making withMean=True the default, which we
> don't want. The point is that if this is *explicitly* requested, the
> scaler shouldn't refuse to subtract the mean from a sparse vector, and
> fail.
>
> On Wed, Aug 10, 2016 at 8:47 PM, Tobi Bosede <ani.to...@gmail.com> wrote:
>> Sean,
>>
>> I have created a jira; I hope you don't mind that I borrowed your
>> explanation of "offset". https://issues.apache.org/jira/browse/SPARK-17001
>>
>> So what did you do to standardize your data, if you didn't use
>> standardScaler? Did you write a udf to subtract mean and divide by standard
>> deviation?
>>
>> Although I know this is not the best approach for something I plan to put in
>> production, I have been trying to write a udf to turn the sparse vector into
>> a dense one and apply the udf in withcolumn(). withColumn() complains that
>> the data is a tuple. I think the issue might be the datatype parameter. The
>> function returns a vector of doubles but there is no type that would be
>> adequate for this.
>>
>> sparseToDense=udf(lambda data: float(DenseVector([data.toArray()])),
>> DoubleType())
>> denseTrainingRdf=trainingRdfAssemb.withColumn("denseFeatures",
>> sparseToDense("features"))
>>
>> However the function works outside the udf, but I am unable to add an
>> arbitrary column to the data frame I started out working with. Thoughts?
>>
>> denseFeatures=TrainingRdf.select("features").map(lambda data:
>> DenseVector([data.features.toArray()]))
>> denseTrainingRdf=trainingRdfAssemb.withColumn("denseFeatures",
>> denseFeatures)
>>
>> Thanks,
>> Tobi
>>
>>
>> On Wed, Aug 10, 2016 at 1:01 PM, Nick Pentreath <nick.pentre...@gmail.com>
>> wrote:
>>>
>>> Ah right, got it. As you say for storage it helps significantly, but for
>>> operations I suspect it puts one back in a "dense-like" position. Still, for
>>> online / mini-batch algorithms it may still be feasible I guess.
>>> On Wed, 10 Aug 2016 at 19:50, Sean Owen <so...@cloudera.com> wrote:
>>>>
>>>> All elements, I think. Imagine a sparse vector 1:3 3:7 which conceptually
>>>> represents 0 3 0 7. Imagine it also has an offset stored which applies to
>>>> all elements. If it is -2 then it now represents -2 1 -2 5, but this
>>>> requires just one extra value to store. It only helps with storage of a
>>>> shifted sparse vector; iterating still typically requires iterating all
>>>> elements.
>>>>
>>>> Probably, where this would help, the caller can track this offset and
>>>> even more efficiently apply this knowledge. I remember digging into this in
>>>> how sparse covariance matrices are computed. It almost but not quite 
>>>> enabled
>>>> an optimization.
>>>>
>>>>
>>>> On Wed, Aug 10, 2016, 18:10 Nick Pentreath <nick.pentre...@gmail.com>
>>>> wrote:
>>>>>
>>>>> Sean by 'offset' do you mean basically subtracting the mean but only
>>>>> from the non-zero elements in each row?
>>>>> On Wed, 10 Aug 2016 at 19:02, Sean Owen <so...@cloudera.com> wrote:
>>>>>>
>>>>>> Yeah I had thought the same, that perhaps it's fine to let the
>>>>>> StandardScaler proceed, if it's explicitly asked to center, rather
>>>>>> than refuse to. It's not really much more rope to let a user

Re: Standardization with Sparse Vectors

2016-08-11 Thread Sean Owen

No, that doesn't describe the change being discussed, since you've
copied the discussion about adding an 'offset'. That's orthogonal.
You're also suggesting making withMean=True the default, which we
don't want. The point is that if this is *explicitly* requested, the
scaler shouldn't refuse to subtract the mean from a sparse vector, and
fail.

On Wed, Aug 10, 2016 at 8:47 PM, Tobi Bosede <ani.to...@gmail.com> wrote:
> Sean,
>
> I have created a jira; I hope you don't mind that I borrowed your
> explanation of "offset". https://issues.apache.org/jira/browse/SPARK-17001
>
> So what did you do to standardize your data, if you didn't use
> standardScaler? Did you write a udf to subtract mean and divide by standard
> deviation?
>
> Although I know this is not the best approach for something I plan to put in
> production, I have been trying to write a udf to turn the sparse vector into
> a dense one and apply the udf in withcolumn(). withColumn() complains that
> the data is a tuple. I think the issue might be the datatype parameter. The
> function returns a vector of doubles but there is no type that would be
> adequate for this.
>
> sparseToDense=udf(lambda data: float(DenseVector([data.toArray()])),
> DoubleType())
> denseTrainingRdf=trainingRdfAssemb.withColumn("denseFeatures",
> sparseToDense("features"))
>
> However the function works outside the udf, but I am unable to add an
> arbitrary column to the data frame I started out working with. Thoughts?
>
> denseFeatures=TrainingRdf.select("features").map(lambda data:
> DenseVector([data.features.toArray()]))
> denseTrainingRdf=trainingRdfAssemb.withColumn("denseFeatures",
> denseFeatures)
>
> Thanks,
> Tobi
>
>
> On Wed, Aug 10, 2016 at 1:01 PM, Nick Pentreath <nick.pentre...@gmail.com>
> wrote:
>>
>> Ah right, got it. As you say for storage it helps significantly, but for
>> operations I suspect it puts one back in a "dense-like" position. Still, for
>> online / mini-batch algorithms it may still be feasible I guess.
>> On Wed, 10 Aug 2016 at 19:50, Sean Owen <so...@cloudera.com> wrote:
>>>
>>> All elements, I think. Imagine a sparse vector 1:3 3:7 which conceptually
>>> represents 0 3 0 7. Imagine it also has an offset stored which applies to
>>> all elements. If it is -2 then it now represents -2 1 -2 5, but this
>>> requires just one extra value to store. It only helps with storage of a
>>> shifted sparse vector; iterating still typically requires iterating all
>>> elements.
>>>
>>> Probably, where this would help, the caller can track this offset and
>>> even more efficiently apply this knowledge. I remember digging into this in
>>> how sparse covariance matrices are computed. It almost but not quite enabled
>>> an optimization.
>>>
>>>
>>> On Wed, Aug 10, 2016, 18:10 Nick Pentreath <nick.pentre...@gmail.com>
>>> wrote:
>>>>
>>>> Sean by 'offset' do you mean basically subtracting the mean but only
>>>> from the non-zero elements in each row?
>>>> On Wed, 10 Aug 2016 at 19:02, Sean Owen <so...@cloudera.com> wrote:
>>>>>
>>>>> Yeah I had thought the same, that perhaps it's fine to let the
>>>>> StandardScaler proceed, if it's explicitly asked to center, rather
>>>>> than refuse to. It's not really much more rope to let a user hang
>>>>> herself with, and, blocks legitimate usages (we ran into this last
>>>>> week and couldn't use StandardScaler as a result).
>>>>>
>>>>> I'm personally supportive of the change and don't see a JIRA. I think
>>>>> you could at least make one.
>>>>>
>>>>> On Wed, Aug 10, 2016 at 5:57 PM, Tobi Bosede <ani.to...@gmail.com>
>>>>> wrote:
>>>>> > Thanks Sean, I agree with 100% that the math is math and dense vs
>>>>> > sparse is
>>>>> > just a matter of representation. I was trying to convince a co-worker
>>>>> > of
>>>>> > this to no avail. Sending this email was mainly a sanity check.
>>>>> >
>>>>> > I think having an offset would be a great idea, although I am not
>>>>> > sure how
>>>>> > to implement this. However, if anything should be done to rectify
>>>>> > this
>>>>> > issue, it should be done in the standardScaler, not vectorAssembler.
>>>>> > There
>>>>> > should not be any forcing of vect

Re: Standardization with Sparse Vectors

2016-08-10 Thread Sean Owen

All elements, I think. Imagine a sparse vector 1:3 3:7 which conceptually
represents 0 3 0 7. Imagine it also has an offset stored which applies to
all elements. If it is -2 then it now represents -2 1 -2 5, but this
requires just one extra value to store. It only helps with storage of a
shifted sparse vector; iterating still typically requires iterating all
elements.

Probably, where this would help, the caller can track this offset and even
more efficiently apply this knowledge. I remember digging into this in how
sparse covariance matrices are computed. It almost but not quite enabled an
optimization.

On Wed, Aug 10, 2016, 18:10 Nick Pentreath <nick.pentre...@gmail.com> wrote:

> Sean by 'offset' do you mean basically subtracting the mean but only from
> the non-zero elements in each row?
> On Wed, 10 Aug 2016 at 19:02, Sean Owen <so...@cloudera.com> wrote:
>
>> Yeah I had thought the same, that perhaps it's fine to let the
>> StandardScaler proceed, if it's explicitly asked to center, rather
>> than refuse to. It's not really much more rope to let a user hang
>> herself with, and, blocks legitimate usages (we ran into this last
>> week and couldn't use StandardScaler as a result).
>>
>> I'm personally supportive of the change and don't see a JIRA. I think
>> you could at least make one.
>>
>> On Wed, Aug 10, 2016 at 5:57 PM, Tobi Bosede <ani.to...@gmail.com> wrote:
>> > Thanks Sean, I agree with 100% that the math is math and dense vs
>> sparse is
>> > just a matter of representation. I was trying to convince a co-worker of
>> > this to no avail. Sending this email was mainly a sanity check.
>> >
>> > I think having an offset would be a great idea, although I am not sure
>> how
>> > to implement this. However, if anything should be done to rectify this
>> > issue, it should be done in the standardScaler, not vectorAssembler.
>> There
>> > should not be any forcing of vectorAssembler to produce only dense
>> vectors
>> > so as to avoid performance problems with data that does not fit in
>> memory.
>> > Furthermore, not every machine learning algo requires standardization.
>> > Instead, standardScaler should have withmean=True as default and should
>> > apply an offset if the vector is sparse, whereas there would be normal
>> > subtraction if the vector is dense. This way the default behavior of
>> > standardScaler will always be what is generally understood to be
>> > standardization, as opposed to people thinking they are standardizing
>> when
>> > they actually are not.
>> >
>> > Can anyone confirm whether there is a jira already?
>> >
>> > On Wed, Aug 10, 2016 at 10:58 AM, Sean Owen <so...@cloudera.com> wrote:
>> >>
>> >> Dense vs sparse is just a question of representation, so doesn't make
>> >> an operation on a vector more or less important as a result. You've
>> >> identified the reason that subtracting the mean can be undesirable: a
>> >> notionally billion-element sparse vector becomes too big to fit in
>> >> memory at once.
>> >>
>> >> I know this came up as a problem recently (I think there's a JIRA?)
>> >> because VectorAssembler will *sometimes* output a small dense vector
>> >> and sometimes output a small sparse vector based on how many zeroes
>> >> there are. But that's bad because then the StandardScaler can't
>> >> process the output at all. You can work on this if you're interested;
>> >> I think the proposal was to be able to force a dense representation
>> >> only in VectorAssembler. I don't know if that's the nature of the
>> >> problem you're hitting.
>> >>
>> >> It can be meaningful to only scale the dimension without centering it,
>> >> but it's not the same thing, no. The math is the math.
>> >>
>> >> This has come up a few times -- it's necessary to center a sparse
>> >> vector but prohibitive to do so. One idea I'd toyed with in the past
>> >> was to let a sparse vector have an 'offset' value applied to all
>> >> elements. That would let you shift all values while preserving a
>> >> sparse representation. I'm not sure if it's worth implementing but
>> >> would help this case.
>> >>
>> >>
>> >>
>> >>
>> >> On Wed, Aug 10, 2016 at 4:41 PM, Tobi Bosede <ani.to...@gmail.com>
>> wrote:
>> >> > Hi everyone,
>> >> >
>> >> > I am doing some standardization

Re: Standardization with Sparse Vectors

2016-08-10 Thread Sean Owen

Yeah I had thought the same, that perhaps it's fine to let the
StandardScaler proceed, if it's explicitly asked to center, rather
than refuse to. It's not really much more rope to let a user hang
herself with, and, blocks legitimate usages (we ran into this last
week and couldn't use StandardScaler as a result).

I'm personally supportive of the change and don't see a JIRA. I think
you could at least make one.

On Wed, Aug 10, 2016 at 5:57 PM, Tobi Bosede <ani.to...@gmail.com> wrote:
> Thanks Sean, I agree with 100% that the math is math and dense vs sparse is
> just a matter of representation. I was trying to convince a co-worker of
> this to no avail. Sending this email was mainly a sanity check.
>
> I think having an offset would be a great idea, although I am not sure how
> to implement this. However, if anything should be done to rectify this
> issue, it should be done in the standardScaler, not vectorAssembler. There
> should not be any forcing of vectorAssembler to produce only dense vectors
> so as to avoid performance problems with data that does not fit in memory.
> Furthermore, not every machine learning algo requires standardization.
> Instead, standardScaler should have withmean=True as default and should
> apply an offset if the vector is sparse, whereas there would be normal
> subtraction if the vector is dense. This way the default behavior of
> standardScaler will always be what is generally understood to be
> standardization, as opposed to people thinking they are standardizing when
> they actually are not.
>
> Can anyone confirm whether there is a jira already?
>
> On Wed, Aug 10, 2016 at 10:58 AM, Sean Owen <so...@cloudera.com> wrote:
>>
>> Dense vs sparse is just a question of representation, so doesn't make
>> an operation on a vector more or less important as a result. You've
>> identified the reason that subtracting the mean can be undesirable: a
>> notionally billion-element sparse vector becomes too big to fit in
>> memory at once.
>>
>> I know this came up as a problem recently (I think there's a JIRA?)
>> because VectorAssembler will *sometimes* output a small dense vector
>> and sometimes output a small sparse vector based on how many zeroes
>> there are. But that's bad because then the StandardScaler can't
>> process the output at all. You can work on this if you're interested;
>> I think the proposal was to be able to force a dense representation
>> only in VectorAssembler. I don't know if that's the nature of the
>> problem you're hitting.
>>
>> It can be meaningful to only scale the dimension without centering it,
>> but it's not the same thing, no. The math is the math.
>>
>> This has come up a few times -- it's necessary to center a sparse
>> vector but prohibitive to do so. One idea I'd toyed with in the past
>> was to let a sparse vector have an 'offset' value applied to all
>> elements. That would let you shift all values while preserving a
>> sparse representation. I'm not sure if it's worth implementing but
>> would help this case.
>>
>>
>>
>>
>> On Wed, Aug 10, 2016 at 4:41 PM, Tobi Bosede <ani.to...@gmail.com> wrote:
>> > Hi everyone,
>> >
>> > I am doing some standardization using standardScaler on data from
>> > VectorAssembler which is represented as sparse vectors. I plan to fit a
>> > regularized model.  However, standardScaler does not allow the mean to
>> > be
>> > subtracted from sparse vectors. It will only divide by the standard
>> > deviation, which I understand is to keep the vector sparse. Thus I am
>> > trying
>> > to convert my sparse vectors into dense vectors, but this may not be
>> > worthwhile.
>> >
>> > So my questions are:
>> > Is subtracting the mean during standardization only important when
>> > working
>> > with dense vectors? Does it not matter for sparse vectors? Is just
>> > dividing
>> > by the standard deviation with sparse vectors equivalent to also
>> > dividing by
>> > standard deviation w and subtracting mean with dense vectors?
>> >
>> > Thank you,
>> > Tobi
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Standardization with Sparse Vectors

2016-08-10 Thread Sean Owen

Dense vs sparse is just a question of representation, so doesn't make
an operation on a vector more or less important as a result. You've
identified the reason that subtracting the mean can be undesirable: a
notionally billion-element sparse vector becomes too big to fit in
memory at once.

I know this came up as a problem recently (I think there's a JIRA?)
because VectorAssembler will *sometimes* output a small dense vector
and sometimes output a small sparse vector based on how many zeroes
there are. But that's bad because then the StandardScaler can't
process the output at all. You can work on this if you're interested;
I think the proposal was to be able to force a dense representation
only in VectorAssembler. I don't know if that's the nature of the
problem you're hitting.

It can be meaningful to only scale the dimension without centering it,
but it's not the same thing, no. The math is the math.

This has come up a few times -- it's necessary to center a sparse
vector but prohibitive to do so. One idea I'd toyed with in the past
was to let a sparse vector have an 'offset' value applied to all
elements. That would let you shift all values while preserving a
sparse representation. I'm not sure if it's worth implementing but
would help this case.

On Wed, Aug 10, 2016 at 4:41 PM, Tobi Bosede  wrote:
> Hi everyone,
>
> I am doing some standardization using standardScaler on data from
> VectorAssembler which is represented as sparse vectors. I plan to fit a
> regularized model.  However, standardScaler does not allow the mean to be
> subtracted from sparse vectors. It will only divide by the standard
> deviation, which I understand is to keep the vector sparse. Thus I am trying
> to convert my sparse vectors into dense vectors, but this may not be
> worthwhile.
>
> So my questions are:
> Is subtracting the mean during standardization only important when working
> with dense vectors? Does it not matter for sparse vectors? Is just dividing
> by the standard deviation with sparse vectors equivalent to also dividing by
> standard deviation w and subtracting mean with dense vectors?
>
> Thank you,
> Tobi

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Machine learning question (suing spark)- removing redundant factors while doing clustering

2016-08-10 Thread Sean Owen

Scaling can mean scaling factors up or down so that they're all on a
comparable scale. It certainly changes the sum of squared errors, but,
you can't compare this metric across scaled and unscaled data, exactly
because one is on a totally different scale and will have quite
different absolute values. If that's the motivation here, then, no
it's misleading.

You probably do want to scale factors because the underlying distance
metric (Euclidean) will treat all dimensions equally. If they're on
very different scales, dimensions that happen to have larger units
will dominate.

On Wed, Aug 10, 2016 at 12:46 PM, Rohit Chaddha
<rohitchaddha1...@gmail.com> wrote:
> Hi Sean,
>
> So basically I am trying to cluster a number of elements (its a domain
> object called PItem) based on a the quality factors of these items.
> These elements have 112 quality factors each.
>
> Now the issue is that when I am scaling the factors using StandardScaler I
> get a Sum of Squared Errors = 13300
> When I don't use scaling the Sum of Squared Errors = 5
>
> I was always of the opinion that different factors being on different scale
> should always be normalized, but I am confused based on the results above
> and I am wondering what factors should be removed to get a meaningful result
> (may be with 5% less accuracy)
>
> Will appreciate any help here.
>
> -Rohit
>
> On Tue, Aug 9, 2016 at 12:55 PM, Sean Owen <so...@cloudera.com> wrote:
>>
>> Fewer features doesn't necessarily mean better predictions, because indeed
>> you are subtracting data. It might, because when done well you subtract more
>> noise than signal. It is usually done to make data sets smaller or more
>> tractable or to improve explainability.
>>
>> But you have an unsupervised clustering problem where talking about
>> feature importance doesnt make as much sense. Important to what? There is no
>> target variable.
>>
>> PCA will not 'improve' clustering per se but can make it faster.
>> You may want to specify what you are actually trying to optimize.
>>
>>
>> On Tue, Aug 9, 2016, 03:23 Rohit Chaddha <rohitchaddha1...@gmail.com>
>> wrote:
>>>
>>> I would rather have less features to make better inferences on the data
>>> based on the smaller number of factors,
>>> Any suggestions Sean ?
>>>
>>> On Mon, Aug 8, 2016 at 11:37 PM, Sean Owen <so...@cloudera.com> wrote:
>>>>
>>>> Yes, that's exactly what PCA is for as Sivakumaran noted. Do you
>>>> really want to select features or just obtain a lower-dimensional
>>>> representation of them, with less redundancy?
>>>>
>>>> On Mon, Aug 8, 2016 at 4:10 PM, Tony Lane <tonylane@gmail.com>
>>>> wrote:
>>>> > There must be an algorithmic way to figure out which of these factors
>>>> > contribute the least and remove them in the analysis.
>>>> > I am hoping same one can throw some insight on this.
>>>> >
>>>> > On Mon, Aug 8, 2016 at 7:41 PM, Sivakumaran S <siva.kuma...@me.com>
>>>> > wrote:
>>>> >>
>>>> >> Not an expert here, but the first step would be devote some time and
>>>> >> identify which of these 112 factors are actually causative. Some
>>>> >> domain
>>>> >> knowledge of the data may be required. Then, you can start of with
>>>> >> PCA.
>>>> >>
>>>> >> HTH,
>>>> >>
>>>> >> Regards,
>>>> >>
>>>> >> Sivakumaran S
>>>> >>
>>>> >> On 08-Aug-2016, at 3:01 PM, Tony Lane <tonylane@gmail.com> wrote:
>>>> >>
>>>> >> Great question Rohit.  I am in my early days of ML as well and it
>>>> >> would be
>>>> >> great if we get some idea on this from other experts on this group.
>>>> >>
>>>> >> I know we can reduce dimensions by using PCA, but i think that does
>>>> >> not
>>>> >> allow us to understand which factors from the original are we using
>>>> >> in the
>>>> >> end.
>>>> >>
>>>> >> - Tony L.
>>>> >>
>>>> >> On Mon, Aug 8, 2016 at 5:12 PM, Rohit Chaddha
>>>> >> <rohitchaddha1...@gmail.com>
>>>> >> wrote:
>>>> >>>
>>>> >>>
>>>> >>> I have a data-set where each data-point has 112 factors.
>>>> >>>
>>>> >>> I want to remove the factors which are not relevant, and say reduce
>>>> >>> to 20
>>>> >>> factors out of these 112 and then do clustering of data-points using
>>>> >>> these
>>>> >>> 20 factors.
>>>> >>>
>>>> >>> How do I do these and how do I figure out which of the 20 factors
>>>> >>> are
>>>> >>> useful for analysis.
>>>> >>>
>>>> >>> I see SVD and PCA implementations, but I am not sure if these give
>>>> >>> which
>>>> >>> elements are removed and which are remaining.
>>>> >>>
>>>> >>> Can someone please help me understand what to do here
>>>> >>>
>>>> >>> thanks,
>>>> >>> -Rohit
>>>> >>>
>>>> >>
>>>> >>
>>>> >
>>>
>>>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark 2.0.1 / 2.1.0 on Maven

2016-08-09 Thread Sean Owen

Nightlies are built and made available in the ASF snapshot repo, from
master. This is noted at the bottom of the downloads page, and at
https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-NightlyBuilds
. This hasn't changed in as long as I can recall.

Nightlies are not blessed, and are not for consumption other than by
developers. That is you shouldn't bundle them in a release, shouldn't
release a product based on "2.0.1 snapshot" for example because no
such ASF release exists. This info isn't meant to be secret, but is
not made obvious to casual end users for this reason. Yes it's for
developers who want to test other products in advance.

So-called preview releases are really just normal releases and are
made available in the usual way. They just have a different name. I
don't know if another one of those will happen; maybe for 3.0.

The published master snapshot would give you 2.1.0-SNAPSHOT at the
moment. Other branches don't have nightlies, but are likely to be of
less interest.

You can always "mvn -DskipTests install" from a checkout of any branch
to make the branch's SNAPSHOT available in your local Maven repo, or
even publish it to your private repo.

On Tue, Aug 9, 2016 at 7:32 PM, Chris Fregly  wrote:
> this is a valid question.  there are many people building products and
> tooling on top of spark and would like access to the latest snapshots and
> such.  today's ink is yesterday's news to these people - including myself.
>
> what is the best way to get snapshot releases including nightly and
> specially-blessed "preview" releases so that we, too, can say "try the
> latest release in our product"?
>
> there was a lot of chatter during the 2.0.0/2.0.1 release that i largely
> ignored because of conflicting/confusing/changing responses.  and i'd rather
> not dig through jenkins builds to figure this out as i'll likely get it
> wrong.
>
> please provide the relevant snapshot/preview/nightly/whatever repos (or
> equivalent) that we need to include in our builds to have access to the
> absolute latest build assets for every major and minor release.
>
> thanks!
>
> -chris
>
>
> On Tue, Aug 9, 2016 at 10:00 AM, Mich Talebzadeh 
> wrote:
>>
>> LOL
>>
>> Ink has not dried on Spark 2 yet so to speak :)
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> Disclaimer: Use it at your own risk. Any and all responsibility for any
>> loss, damage or destruction of data or any other property which may arise
>> from relying on this email's technical content is explicitly disclaimed. The
>> author will in no case be liable for any monetary damages arising from such
>> loss, damage or destruction.
>>
>>
>>
>>
>> On 9 August 2016 at 17:56, Mark Hamstra  wrote:
>>>
>>> What are you expecting to find?  There currently are no releases beyond
>>> Spark 2.0.0.
>>>
>>> On Tue, Aug 9, 2016 at 9:55 AM, Jestin Ma 
>>> wrote:

 If we want to use versions of Spark beyond the official 2.0.0 release,
 specifically on Maven + Java, what steps should we take to upgrade? I can't
 find the newer versions on Maven central.

 Thank you!
 Jestin
>>>
>>>
>>
>
>
>
> --
> Chris Fregly
> Research Scientist @ PipelineIO
> San Francisco, CA
> pipeline.io
> advancedspark.com
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Machine learning question (suing spark)- removing redundant factors while doing clustering

2016-08-09 Thread Sean Owen

Fewer features doesn't necessarily mean better predictions, because indeed
you are subtracting data. It might, because when done well you subtract
more noise than signal. It is usually done to make data sets smaller or
more tractable or to improve explainability.

But you have an unsupervised clustering problem where talking about feature
importance doesnt make as much sense. Important to what? There is no target
variable.

PCA will not 'improve' clustering per se but can make it faster.
You may want to specify what you are actually trying to optimize.

On Tue, Aug 9, 2016, 03:23 Rohit Chaddha <rohitchaddha1...@gmail.com> wrote:

> I would rather have less features to make better inferences on the data
> based on the smaller number of factors,
> Any suggestions Sean ?
>
> On Mon, Aug 8, 2016 at 11:37 PM, Sean Owen <so...@cloudera.com> wrote:
>
>> Yes, that's exactly what PCA is for as Sivakumaran noted. Do you
>> really want to select features or just obtain a lower-dimensional
>> representation of them, with less redundancy?
>>
>> On Mon, Aug 8, 2016 at 4:10 PM, Tony Lane <tonylane@gmail.com> wrote:
>> > There must be an algorithmic way to figure out which of these factors
>> > contribute the least and remove them in the analysis.
>> > I am hoping same one can throw some insight on this.
>> >
>> > On Mon, Aug 8, 2016 at 7:41 PM, Sivakumaran S <siva.kuma...@me.com>
>> wrote:
>> >>
>> >> Not an expert here, but the first step would be devote some time and
>> >> identify which of these 112 factors are actually causative. Some domain
>> >> knowledge of the data may be required. Then, you can start of with PCA.
>> >>
>> >> HTH,
>> >>
>> >> Regards,
>> >>
>> >> Sivakumaran S
>> >>
>> >> On 08-Aug-2016, at 3:01 PM, Tony Lane <tonylane@gmail.com> wrote:
>> >>
>> >> Great question Rohit.  I am in my early days of ML as well and it
>> would be
>> >> great if we get some idea on this from other experts on this group.
>> >>
>> >> I know we can reduce dimensions by using PCA, but i think that does not
>> >> allow us to understand which factors from the original are we using in
>> the
>> >> end.
>> >>
>> >> - Tony L.
>> >>
>> >> On Mon, Aug 8, 2016 at 5:12 PM, Rohit Chaddha <
>> rohitchaddha1...@gmail.com>
>> >> wrote:
>> >>>
>> >>>
>> >>> I have a data-set where each data-point has 112 factors.
>> >>>
>> >>> I want to remove the factors which are not relevant, and say reduce
>> to 20
>> >>> factors out of these 112 and then do clustering of data-points using
>> these
>> >>> 20 factors.
>> >>>
>> >>> How do I do these and how do I figure out which of the 20 factors are
>> >>> useful for analysis.
>> >>>
>> >>> I see SVD and PCA implementations, but I am not sure if these give
>> which
>> >>> elements are removed and which are remaining.
>> >>>
>> >>> Can someone please help me understand what to do here
>> >>>
>> >>> thanks,
>> >>> -Rohit
>> >>>
>> >>
>> >>
>> >
>>
>
>

Re: FW: Have I done everything correctly when subscribing to Spark User List

2016-08-08 Thread Sean Owen

I also don't know what's going on with the "This post has NOT been
accepted by the mailing list yet" message, because actually the
messages always do post. In fact this has been sent to the list 4
times:

https://www.mail-archive.com/search?l=user%40spark.apache.org=dueckm=0=0

On Mon, Aug 8, 2016 at 3:03 PM, Chris Mattmann  wrote:
>
>
>
>
>
> On 8/8/16, 2:03 AM, "matthias.du...@fiduciagad.de" 
>  wrote:
>
>>Hello,
>>
>>I write to you because I am not really sure whether I did everything right 
>>when registering and subscribing to the spark user list.
>>
>>I posted the appended question to Spark User list after subscribing and 
>>receiving the "WELCOME to user@spark.apache.org" mail from 
>>"user-h...@spark.apache.org".
>> But this post is still in state "This post has NOT been accepted by the 
>> mailing list yet.".
>>
>>Is this because I forgot something to do or did something wrong with my user 
>>account (dueckm)? Or is it because no member of the Spark User List reacted 
>>to that post yet?
>>
>>Thanks a lot for yout help.
>>
>>Matthias
>>
>>Fiducia & GAD IT AG | www.fiduciagad.de
>>AG Frankfurt a. M. HRB 102381 | Sitz der Gesellschaft: Hahnstr. 48, 60528 
>>Frankfurt a. M. | USt-IdNr. DE 143582320
>>Vorstand: Klaus-Peter Bruns (Vorsitzender), Claus-Dieter Toben (stv. 
>>Vorsitzender),
>>
>>Jens-Olaf Bartels, Martin Beyer, Jörg Dreinhöfer, Wolfgang Eckert, Carsten 
>>Pfläging, Jörg Staff
>>Vorsitzender des Aufsichtsrats: Jürgen Brinkmann
>>
>>- Weitergeleitet von Matthias Dück/M/FAG/FIDUCIA/DE am 08.08.2016 10:57 
>>-
>>
>>Von: dueckm 
>>An: user@spark.apache.org
>>Datum: 04.08.2016 13:27
>>Betreff: Are join/groupBy operations with wide Java Beans using Dataset API 
>>much slower than using RDD API?
>>
>>
>>
>>
>>
>>Hello,
>>
>>I built a prototype that uses join and groupBy operations via Spark RDD API.
>>Recently I migrated it to the Dataset API. Now it runs much slower than with
>>the original RDD implementation.
>>Did I do something wrong here? Or is this a price I have to pay for the more
>>convienient API?
>>Is there a known solution to deal with this effect (eg configuration via
>>"spark.sql.shuffle.partitions" - but now could I determine the correct
>>value)?
>>In my prototype I use Java Beans with a lot of attributes. Does this slow
>>down Spark-operations with Datasets?
>>
>>Here I have an simple example, that shows the difference:
>>JoinGroupByTest.zip
>>
>>- I build 2 RDDs and join and group them. Afterwards I count and display the
>>joined RDDs.  (Method de.testrddds.JoinGroupByTest.joinAndGroupViaRDD() )
>>- When I do the same actions with Datasets it takes approximately 40 times
>>as long (Methodd e.testrddds.JoinGroupByTest.joinAndGroupViaDatasets()).
>>
>>Thank you very much for your help.
>>Matthias
>>
>>PS1: excuse me for sending this post more than once, but I am new to this
>>mailing list and probably did something wrong when registering/subscribing,
>>so my previous postings have not been accepted ...
>>
>>PS2: See the appended screenshots taken from Spark UI (jobs 0/1 belong to
>>RDD implementation, jobs 2/3 to Dataset):
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>--
>>View this message in context: 
>>http://apache-spark-user-list.1001560.n3.nabble.com/Are-join-groupBy-operations-with-wide-Java-Beans-using-Dataset-API-much-slower-than-using-RDD-API-tp27473.html
>>Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>>-
>>To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Source format for Apache Spark logo

2016-08-08 Thread Sean Owen

In case the attachments don't come through, BTW those are indeed
downloadable from the directory http://spark.apache.org/images/

On Mon, Aug 8, 2016 at 6:09 PM, Sivakumaran S  wrote:
> Found these from the spark.apache.org website.
>
> HTH,
>
> Sivakumaran S
>
>
>
>
>
> On 08-Aug-2016, at 5:24 PM, michael.ar...@gdata-adan.de wrote:
>
> Hi,
>
> for a presentation I’d apreciate a vector version of the Apache Spark logo,
> unfortunately I cannot find it. Is the Logo available in a vector format
> somewhere?
>  s
>
>
> 
> Virus checked by G Data MailSecurity
> Version: AVA 25.7800 dated 08.08.2016
> Virus news: www.antiviruslab.com
>
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark 2.0 - make-distribution fails while regular build succeeded

2016-08-04 Thread Sean Owen

That message is a warning, not error. It is just because you're cross
compiling with Java 8. If something failed it was elsewhere.

On Thu, Aug 4, 2016, 07:09 Richard Siebeling  wrote:

> Hi,
>
> spark 2.0 with mapr hadoop libraries was succesfully build using the
> following command:
> ./build/mvn -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.0-mapr-1602
> -DskipTests clean package
>
> However when I then try to build a runnable distribution using the
> following command
> ./dev/make-distribution.sh --tgz -Pyarn -Phadoop-2.7
> -Dhadoop.version=2.7.0-mapr-1602
>
> It fails with the error "bootstrap class path not set in conjunction with
> -source 1.7"
> Could you please help? I do not know what this error means,
>
> thanks in advance,
> Richard
>
>
>

Re: 2.0.0 packages for twitter streaming, flume and other connectors

2016-08-03 Thread Sean Owen

You're looking for http://bahir.apache.org/

On Wed, Aug 3, 2016 at 8:40 PM, Kiran Chitturi
 wrote:
> Hi,
>
> When Spark 2.0.0 is released, the 'spark-streaming-twitter' package and
> several other packages are not released/published to maven central. It looks
> like these packages are removed from the official repo of Spark.
>
> I found the replacement git repos for these missing packages at
> https://github.com/spark-packages. This seems to have been created by some
> of the Spark committers.
>
> However, the repos haven't been active since last 5 months and none of the
> versions are released/published.
>
> Is https://github.com/spark-packages supposed to be the new official place
> for these missing streaming packages ?
>
> If so, how can we get someone to release and publish new versions officially
> ?
>
> I would like to help in any way possible to get these packages released and
> published.
>
> Thanks,
> --
> Kiran Chitturi
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Using sparse vector leads to array out of bounds exception

2016-08-03 Thread Sean Owen

Yeah, that's libsvm format, which is 1-indexed.

On Wed, Aug 3, 2016 at 12:45 PM, Tony Lane <tonylane@gmail.com> wrote:
> I guess the setup of the model and usage of the vector got to me.
> Setup takes position 1 , 2 , 3  - like this in the build example - "1:0.0
> 2:0.0 3:0.0"
> I thought I need to follow the same numbering while creating vector too.
>
> thanks a bunch
>
>
> On Thu, Aug 4, 2016 at 12:39 AM, Sean Owen <so...@cloudera.com> wrote:
>>
>> You mean "new int[] {0,1,2}" because vectors are 0-indexed.
>>
>> On Wed, Aug 3, 2016 at 11:52 AM, Tony Lane <tonylane@gmail.com> wrote:
>> > Hi Sean,
>> >
>> > I did not understand,
>> > I created a KMeansModel with 3 dimensions and then I am calling predict
>> > method on this model with a 3 dimension vector ?
>> > I am not sre what is wrong in this approach. i am missing a point ?
>> >
>> > Tony
>> >
>> > On Wed, Aug 3, 2016 at 11:22 PM, Sean Owen <so...@cloudera.com> wrote:
>> >>
>> >> You declare that the vector has 3 dimensions, but then refer to its
>> >> 4th dimension (at index 3). That is the error.
>> >>
>> >> On Wed, Aug 3, 2016 at 10:43 AM, Tony Lane <tonylane@gmail.com>
>> >> wrote:
>> >> > I am using the following vector definition in java
>> >> >
>> >> > Vectors.sparse(3, new int[] { 1, 2, 3 }, new double[] { 1.1, 1.1, 1.1
>> >> > }))
>> >> >
>> >> > However when I run the predict method on this vector it leads to
>> >> >
>> >> > Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException:
>> >> > 3
>> >> > at org.apache.spark.mllib.linalg.BLAS$.dot(BLAS.scala:143)
>> >> > at org.apache.spark.mllib.linalg.BLAS$.dot(BLAS.scala:115)
>> >> > at
>> >> >
>> >> >
>> >> > org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:298)
>> >> > at
>> >> >
>> >> >
>> >> > org.apache.spark.mllib.clustering.KMeans$.fastSquaredDistance(KMeans.scala:606)
>> >> > at
>> >> >
>> >> >
>> >> > org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:580)
>> >> > at
>> >> >
>> >> >
>> >> > org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:574)
>> >> > at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:74)
>> >> > at
>> >> >
>> >> > org.apache.spark.mllib.clustering.KMeans$.findClosest(KMeans.scala:574)
>> >> > at
>> >> >
>> >> >
>> >> > org.apache.spark.mllib.clustering.KMeansModel.predict(KMeansModel.scala:59)
>> >> > at
>> >> > org.apache.spark.ml.clustering.KMeansModel.predict(KMeans.scala:130)
>> >
>> >
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Using sparse vector leads to array out of bounds exception

2016-08-03 Thread Sean Owen

You mean "new int[] {0,1,2}" because vectors are 0-indexed.

On Wed, Aug 3, 2016 at 11:52 AM, Tony Lane <tonylane@gmail.com> wrote:
> Hi Sean,
>
> I did not understand,
> I created a KMeansModel with 3 dimensions and then I am calling predict
> method on this model with a 3 dimension vector ?
> I am not sre what is wrong in this approach. i am missing a point ?
>
> Tony
>
> On Wed, Aug 3, 2016 at 11:22 PM, Sean Owen <so...@cloudera.com> wrote:
>>
>> You declare that the vector has 3 dimensions, but then refer to its
>> 4th dimension (at index 3). That is the error.
>>
>> On Wed, Aug 3, 2016 at 10:43 AM, Tony Lane <tonylane@gmail.com> wrote:
>> > I am using the following vector definition in java
>> >
>> > Vectors.sparse(3, new int[] { 1, 2, 3 }, new double[] { 1.1, 1.1, 1.1
>> > }))
>> >
>> > However when I run the predict method on this vector it leads to
>> >
>> > Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 3
>> > at org.apache.spark.mllib.linalg.BLAS$.dot(BLAS.scala:143)
>> > at org.apache.spark.mllib.linalg.BLAS$.dot(BLAS.scala:115)
>> > at
>> >
>> > org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:298)
>> > at
>> >
>> > org.apache.spark.mllib.clustering.KMeans$.fastSquaredDistance(KMeans.scala:606)
>> > at
>> >
>> > org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:580)
>> > at
>> >
>> > org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:574)
>> > at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:74)
>> > at
>> > org.apache.spark.mllib.clustering.KMeans$.findClosest(KMeans.scala:574)
>> > at
>> >
>> > org.apache.spark.mllib.clustering.KMeansModel.predict(KMeansModel.scala:59)
>> > at org.apache.spark.ml.clustering.KMeansModel.predict(KMeans.scala:130)
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: java.net.URISyntaxException: Relative path in absolute URI:

2016-08-03 Thread Sean Owen

file: "absolute directory"
does not sound like a valid URI

On Wed, Aug 3, 2016 at 11:05 AM, Flavio  wrote:
> Hello everyone,
>
> I am try to run a very easy example but unfortunately I am stuck on the
> follow exception:
>
> Exception in thread "main" java.lang.IllegalArgumentException:
> java.net.URISyntaxException: Relative path in absolute URI: file: "absolute
> directory"
>
> I was wondering if anyone got this exception trying to run the examples on
> the spark git repo; actually the code I am try to run is the follow:
>
>
> //$example on$
> import org.apache.spark.ml.Pipeline;
> import org.apache.spark.ml.PipelineModel;
> import org.apache.spark.ml.PipelineStage;
> import org.apache.spark.ml.evaluation.RegressionEvaluator;
> import org.apache.spark.ml.feature.VectorIndexer;
> import org.apache.spark.ml.feature.VectorIndexerModel;
> import org.apache.spark.ml.regression.RandomForestRegressionModel;
> import org.apache.spark.ml.regression.RandomForestRegressor;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.SparkSession;
> //$example off$
>
> public class JavaRandomForestRegressorExample {
> public static void main(String[] args) {
> System.setProperty("hadoop.home.dir", "C:\\winutils");
>
> SparkSession spark = SparkSession
> .builder()
> .master("local[*]")
> .appName("JavaRandomForestRegressorExample")
> .getOrCreate();
>
> // $example on$
> // Load and parse the data file, converting it to a DataFrame.
> Dataset data =
> spark.read().format("libsvm").load("C:\\data\\sample_libsvm_data.txt");
>
> // Automatically identify categorical features, and index 
> them.
> // Set maxCategories so features with > 4 distinct values are 
> treated as
> // continuous.
> VectorIndexerModel featureIndexer = new
> VectorIndexer().setInputCol("features").setOutputCol("indexedFeatures")
> .setMaxCategories(4).fit(data);
>
> // Split the data into training and test sets (30% held out 
> for testing)
> Dataset[] splits = data.randomSplit(new double[] { 0.7, 
> 0.3 });
> Dataset trainingData = splits[0];
> Dataset testData = splits[1];
>
> // Train a RandomForest model.
> RandomForestRegressor rf = new
> RandomForestRegressor().setLabelCol("label").setFeaturesCol("indexedFeatures");
>
> // Chain indexer and forest in a Pipeline
> Pipeline pipeline = new Pipeline().setStages(new 
> PipelineStage[] {
> featureIndexer, rf });
>
> // Train model. This also runs the indexer.
> PipelineModel model = pipeline.fit(trainingData);
>
> // Make predictions.
> Dataset predictions = model.transform(testData);
>
> // Select example rows to display.
> predictions.select("prediction", "label", "features").show(5);
>
> // Select (prediction, true label) and compute test error
> RegressionEvaluator evaluator = new
> RegressionEvaluator().setLabelCol("label").setPredictionCol("prediction")
> .setMetricName("rmse");
> double rmse = evaluator.evaluate(predictions);
> System.out.println("Root Mean Squared Error (RMSE) on test 
> data = " +
> rmse);
>
> RandomForestRegressionModel rfModel = 
> (RandomForestRegressionModel)
> (model.stages()[1]);
> System.out.println("Learned regression forest model:\n" +
> rfModel.toDebugString());
> // $example off$
>
> spark.stop();
> }
> }
>
>
> Thanks to everyone for reading/answering!
>
> Flavio
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/java-net-URISyntaxException-Relative-path-in-absolute-URI-tp27466.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

< 2 3 4 5 6 7 8 9 10 11 >

601 - 700 of 1849 matches

Mail list logo