Re: Intellij IDEA 14 env setup; NoClassDefFoundError when run examples

2015-02-01 Thread Yafeng Guo
Finally it works.

@Sean, I'm trying to setup env in IDE so I can track into to Spark -- that
will help me understand Spark internal mechanism.

@Ted, thanks. I'm using Maven, not SBT, but thanks for the suggestion
anyway.

For others who might interested in:

I choose bigtop-dist profile so under Spark-Assembly maven will build a
fat jar file, which include all artifacts built. Then I added such fat jar
as a dependency of examples module, set it as runtime.

Then I find I can debug and track code step by step.

This is maybe not a formal way but at least works for me.

Regards,
Ya-Feng

On Sun, Feb 1, 2015 at 6:44 PM, Sean Owen so...@cloudera.com wrote:

 How do you mean you run LogQuery? you would run these using the
 run-example script rather than in IntelliJ.

 On Sun, Feb 1, 2015 at 4:01 AM, Yafeng Guo daniel.yafeng@gmail.com
 wrote:
  Hi,
 
  I'm setting up a dev environment with Intellij IDEA 14. I selected
 profile
  scala-2.10, maven-3, hadoop 2.4, hive, hive 0.13.1. The compilation
 passed.
  But when I try to run LogQuery in examples, I met below issue:
 
  Connected to the target VM, address: '127.0.0.1:37182', transport:
 'socket'
  Exception in thread main java.lang.NoClassDefFoundError:
  org/apache/spark/SparkConf
  at org.apache.spark.examples.LogQuery$.main(LogQuery.scala:46)
  at org.apache.spark.examples.LogQuery.main(LogQuery.scala)
  Caused by: java.lang.ClassNotFoundException: org.apache.spark.SparkConf
  at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
  at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
  at java.security.AccessController.doPrivileged(Native Method)
  at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
  ... 2 more
  Disconnected from the target VM, address: '127.0.0.1:37182', transport:
  'socket'
 
  anyone met similar issue before? Thanks a lot
 
  Regards,
  Ya-Feng



Re: renaming SchemaRDD - DataFrame

2015-02-01 Thread Evan Chan
It is true that you can persist SchemaRdds / DataFrames to disk via
Parquet, but a lot of time and inefficiencies is lost.   The in-memory
columnar cached representation is completely different from the
Parquet file format, and I believe there has to be a translation into
a Row (because ultimately Spark SQL traverses Row's -- even the
InMemoryColumnarTableScan has to then convert the columns into Rows
for row-based processing).   On the other hand, traditional data
frames process in a columnar fashion.   Columnar storage is good, but
nowhere near as good as columnar processing.

Another issue, which I don't know if it is solved yet, but it is
difficult for Tachyon to efficiently cache Parquet files without
understanding the file format itself.

I gave a talk at last year's Spark Summit on this topic.

I'm working on efforts to change this, however.  Shoot me an email at
velvia at gmail if you're interested in joining forces.

On Thu, Jan 29, 2015 at 1:59 PM, Cheng Lian lian.cs@gmail.com wrote:
 Yes, when a DataFrame is cached in memory, it's stored in an efficient
 columnar format. And you can also easily persist it on disk using Parquet,
 which is also columnar.

 Cheng


 On 1/29/15 1:24 PM, Koert Kuipers wrote:

 to me the word DataFrame does come with certain expectations. one of them
 is that the data is stored columnar. in R data.frame internally uses a
 list
 of sequences i think, but since lists can have labels its more like a
 SortedMap[String, Array[_]]. this makes certain operations very cheap
 (such
 as adding a column).

 in Spark the closest thing would be a data structure where per Partition
 the data is also stored columnar. does spark SQL already use something
 like
 that? Evan mentioned Spark SQL columnar compression, which sounds like
 it. where can i find that?

 thanks

 On Thu, Jan 29, 2015 at 2:32 PM, Evan Chan velvia.git...@gmail.com
 wrote:

 +1 having proper NA support is much cleaner than using null, at
 least the Java null.

 On Wed, Jan 28, 2015 at 6:10 PM, Evan R. Sparks evan.spa...@gmail.com
 wrote:

 You've got to be a little bit careful here. NA in systems like R or

 pandas

 may have special meaning that is distinct from null.

 See, e.g. http://www.r-bloggers.com/r-na-vs-null/



 On Wed, Jan 28, 2015 at 4:42 PM, Reynold Xin r...@databricks.com

 wrote:

 Isn't that just null in SQL?

 On Wed, Jan 28, 2015 at 4:41 PM, Evan Chan velvia.git...@gmail.com
 wrote:

 I believe that most DataFrame implementations out there, like Pandas,
 supports the idea of missing values / NA, and some support the idea of
 Not Meaningful as well.

 Does Row support anything like that?  That is important for certain
 applications.  I thought that Row worked by being a mutable object,
 but haven't looked into the details in a while.

 -Evan

 On Wed, Jan 28, 2015 at 4:23 PM, Reynold Xin r...@databricks.com
 wrote:

 It shouldn't change the data source api at all because data sources

 create

 RDD[Row], and that gets converted into a DataFrame automatically

 (previously

 to SchemaRDD).




 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala

 One thing that will break the data source API in 1.3 is the location
 of
 types. Types were previously defined in sql.catalyst.types, and now

 moved to

 sql.types. After 1.3, sql.catalyst is hidden from users, and all
 public

 APIs

 have first class classes/objects defined in sql directly.



 On Wed, Jan 28, 2015 at 4:20 PM, Evan Chan velvia.git...@gmail.com

 wrote:

 Hey guys,

 How does this impact the data sources API?  I was planning on using
 this for a project.

 +1 that many things from spark-sql / DataFrame is universally
 desirable and useful.

 By the way, one thing that prevents the columnar compression stuff

 in

 Spark SQL from being more useful is, at least from previous talks
 with
 Reynold and Michael et al., that the format was not designed for
 persistence.

 I have a new project that aims to change that.  It is a
 zero-serialisation, high performance binary vector library,

 designed

 from the outset to be a persistent storage friendly.  May be one

 day

 it can replace the Spark SQL columnar compression.

 Michael told me this would be a lot of work, and recreates parts of
 Parquet, but I think it's worth it.  LMK if you'd like more

 details.

 -Evan

 On Tue, Jan 27, 2015 at 4:35 PM, Reynold Xin r...@databricks.com

 wrote:

 Alright I have merged the patch (
 https://github.com/apache/spark/pull/4173
 ) since I don't see any strong opinions against it (as a matter

 of

 fact

 most were for it). We can still change it if somebody lays out a

 strong

 argument.

 On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia
 matei.zaha...@gmail.com
 wrote:

 The type alias means your methods can specify either type and

 they

 will

 work. It's just another name for the same type. But Scaladocs

 and

 such

 will
 show DataFrame as the type.

 Matei

 On Jan 27, 

Re: Intellij IDEA 14 env setup; NoClassDefFoundError when run examples

2015-02-01 Thread Sean Owen
How do you mean you run LogQuery? you would run these using the
run-example script rather than in IntelliJ.

On Sun, Feb 1, 2015 at 4:01 AM, Yafeng Guo daniel.yafeng@gmail.com wrote:
 Hi,

 I'm setting up a dev environment with Intellij IDEA 14. I selected profile
 scala-2.10, maven-3, hadoop 2.4, hive, hive 0.13.1. The compilation passed.
 But when I try to run LogQuery in examples, I met below issue:

 Connected to the target VM, address: '127.0.0.1:37182', transport: 'socket'
 Exception in thread main java.lang.NoClassDefFoundError:
 org/apache/spark/SparkConf
 at org.apache.spark.examples.LogQuery$.main(LogQuery.scala:46)
 at org.apache.spark.examples.LogQuery.main(LogQuery.scala)
 Caused by: java.lang.ClassNotFoundException: org.apache.spark.SparkConf
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
 ... 2 more
 Disconnected from the target VM, address: '127.0.0.1:37182', transport:
 'socket'

 anyone met similar issue before? Thanks a lot

 Regards,
 Ya-Feng

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Caching tables at column level

2015-02-01 Thread Mick Davies
I have been working a lot recently with denormalised tables with lots of
columns, nearly 600. We are using this form to avoid joins. 

I have tried to use cache table with this data, but it proves too expensive
as it seems to try to cache all the data in the table.

For data sets such as the one I am using you find that certain columns will
be hot, referenced frequently in queries, others will be used very
infrequently.

Therefore it would be great if caches could be column based. I realise that
this may not be optimal for all use cases, but I think it could be quite a
common need.  Has something like this been considered?

Thanks Mick



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Caching-tables-at-column-level-tp10377.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Any interest in 'weighting' VectorTransformer which does component-wise scaling?

2015-02-01 Thread Octavian Geagla
I've added support for sparse vectors and created HadamardTF for the
pipeline, please take a look  on my branch
https://github.com/ogeagla/spark/compare/spark-mllib-weighting  .

Thanks!



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Any-interest-in-weighting-VectorTransformer-which-does-component-wise-scaling-tp10265p10378.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Caching tables at column level

2015-02-01 Thread Michael Armbrust
Its not completely transparent, but you can do something like the following
today:

CACHE TABLE hotData AS SELECT columns, I, care, about FROM fullTable

On Sun, Feb 1, 2015 at 3:03 AM, Mick Davies michael.belldav...@gmail.com
wrote:

 I have been working a lot recently with denormalised tables with lots of
 columns, nearly 600. We are using this form to avoid joins.

 I have tried to use cache table with this data, but it proves too expensive
 as it seems to try to cache all the data in the table.

 For data sets such as the one I am using you find that certain columns will
 be hot, referenced frequently in queries, others will be used very
 infrequently.

 Therefore it would be great if caches could be column based. I realise that
 this may not be optimal for all use cases, but I think it could be quite a
 common need.  Has something like this been considered?

 Thanks Mick



 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Caching-tables-at-column-level-tp10377.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: Custom Cluster Managers / Standalone Recovery Mode in Spark

2015-02-01 Thread Aaron Davidson
For the specific question of supplementing Standalone Mode with a custom
leader election protocol, this was actually already committed in master and
will be available in Spark 1.3:

https://github.com/apache/spark/pull/771/files

You can specify spark.deploy.recoveryMode = CUSTOM
and spark.deploy.recoveryMode.factory to a class which
implements StandaloneRecoveryModeFactory. See the current implementations
of FileSystemRecoveryModeFactory and ZooKeeperRecoveryModeFactory.

I will update the JIRA you linked to be more current.

On Sat, Jan 31, 2015 at 12:55 AM, Anjana Fernando laferna...@gmail.com
wrote:

 Hi everyone,

 I've been experimenting, and somewhat of a newbie for Spark. I was
 wondering, if there is any way, that I can use a custom cluster manager
 implementation with Spark. Basically, as I understood, at the moment, the
 inbuilt modes supported are with standalone, Mesos and  Yarn. My
 requirement is basically a simple clustering solution with high
 availability of the master. I don't want to use a separate Zookeeper
 cluster, since this would complicate my deployment, but rather, I would
 like to use something like Hazelcast, which has a peer-to-peer cluster
 coordination implementation.

 I found that, there is already this JIRA [1], which requests for a custom
 persistence engine, I guess for storing state information. So basically,
 what I would want to do is, use Hazelcast to use for leader election, to
 make an existing node the master, and to lookup the state information from
 the distributed memory. Appreciate any help on how to archive this. And if
 it useful for a wider audience, hopefully I can contribute this back to the
 project.

 [1] https://issues.apache.org/jira/browse/SPARK-1180

 Cheers,
 Anjana.



Word2Vec IndexedRDD

2015-02-01 Thread Michael Malak
1. Is IndexedRDD planned for 1.3? 
https://issues.apache.org/jira/browse/SPARK-2365

2. Once IndexedRDD is in, is it planned to convert Word2VecModel to it from its 
current Map[String,Array[Float]]? 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala#L425

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Custom Cluster Managers / Standalone Recovery Mode in Spark

2015-02-01 Thread Anjana Fernando
Hi guys,

That's great to hear that this is available in Spark 1.3! .. I will play
around with this feature and let you know the results for integrating
Hazelcast. Also, may I know the tentative release date for Spark 1.3? ..

Cheers,
Anjana.

On Mon, Feb 2, 2015 at 3:07 AM, Aaron Davidson ilike...@gmail.com wrote:

 For the specific question of supplementing Standalone Mode with a custom
 leader election protocol, this was actually already committed in master and
 will be available in Spark 1.3:

 https://github.com/apache/spark/pull/771/files

 You can specify spark.deploy.recoveryMode = CUSTOM
 and spark.deploy.recoveryMode.factory to a class which
 implements StandaloneRecoveryModeFactory. See the current implementations
 of FileSystemRecoveryModeFactory and ZooKeeperRecoveryModeFactory.

 I will update the JIRA you linked to be more current.

 On Sat, Jan 31, 2015 at 12:55 AM, Anjana Fernando laferna...@gmail.com
 wrote:

 Hi everyone,

 I've been experimenting, and somewhat of a newbie for Spark. I was
 wondering, if there is any way, that I can use a custom cluster manager
 implementation with Spark. Basically, as I understood, at the moment, the
 inbuilt modes supported are with standalone, Mesos and  Yarn. My
 requirement is basically a simple clustering solution with high
 availability of the master. I don't want to use a separate Zookeeper
 cluster, since this would complicate my deployment, but rather, I would
 like to use something like Hazelcast, which has a peer-to-peer cluster
 coordination implementation.

 I found that, there is already this JIRA [1], which requests for a custom
 persistence engine, I guess for storing state information. So basically,
 what I would want to do is, use Hazelcast to use for leader election, to
 make an existing node the master, and to lookup the state information from
 the distributed memory. Appreciate any help on how to archive this. And if
 it useful for a wider audience, hopefully I can contribute this back to
 the
 project.

 [1] https://issues.apache.org/jira/browse/SPARK-1180

 Cheers,
 Anjana.