Re: Intellij IDEA 14 env setup; NoClassDefFoundError when run examples
Finally it works. @Sean, I'm trying to setup env in IDE so I can track into to Spark -- that will help me understand Spark internal mechanism. @Ted, thanks. I'm using Maven, not SBT, but thanks for the suggestion anyway. For others who might interested in: I choose bigtop-dist profile so under Spark-Assembly maven will build a fat jar file, which include all artifacts built. Then I added such fat jar as a dependency of examples module, set it as runtime. Then I find I can debug and track code step by step. This is maybe not a formal way but at least works for me. Regards, Ya-Feng On Sun, Feb 1, 2015 at 6:44 PM, Sean Owen so...@cloudera.com wrote: How do you mean you run LogQuery? you would run these using the run-example script rather than in IntelliJ. On Sun, Feb 1, 2015 at 4:01 AM, Yafeng Guo daniel.yafeng@gmail.com wrote: Hi, I'm setting up a dev environment with Intellij IDEA 14. I selected profile scala-2.10, maven-3, hadoop 2.4, hive, hive 0.13.1. The compilation passed. But when I try to run LogQuery in examples, I met below issue: Connected to the target VM, address: '127.0.0.1:37182', transport: 'socket' Exception in thread main java.lang.NoClassDefFoundError: org/apache/spark/SparkConf at org.apache.spark.examples.LogQuery$.main(LogQuery.scala:46) at org.apache.spark.examples.LogQuery.main(LogQuery.scala) Caused by: java.lang.ClassNotFoundException: org.apache.spark.SparkConf at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 2 more Disconnected from the target VM, address: '127.0.0.1:37182', transport: 'socket' anyone met similar issue before? Thanks a lot Regards, Ya-Feng
Re: renaming SchemaRDD - DataFrame
It is true that you can persist SchemaRdds / DataFrames to disk via Parquet, but a lot of time and inefficiencies is lost. The in-memory columnar cached representation is completely different from the Parquet file format, and I believe there has to be a translation into a Row (because ultimately Spark SQL traverses Row's -- even the InMemoryColumnarTableScan has to then convert the columns into Rows for row-based processing). On the other hand, traditional data frames process in a columnar fashion. Columnar storage is good, but nowhere near as good as columnar processing. Another issue, which I don't know if it is solved yet, but it is difficult for Tachyon to efficiently cache Parquet files without understanding the file format itself. I gave a talk at last year's Spark Summit on this topic. I'm working on efforts to change this, however. Shoot me an email at velvia at gmail if you're interested in joining forces. On Thu, Jan 29, 2015 at 1:59 PM, Cheng Lian lian.cs@gmail.com wrote: Yes, when a DataFrame is cached in memory, it's stored in an efficient columnar format. And you can also easily persist it on disk using Parquet, which is also columnar. Cheng On 1/29/15 1:24 PM, Koert Kuipers wrote: to me the word DataFrame does come with certain expectations. one of them is that the data is stored columnar. in R data.frame internally uses a list of sequences i think, but since lists can have labels its more like a SortedMap[String, Array[_]]. this makes certain operations very cheap (such as adding a column). in Spark the closest thing would be a data structure where per Partition the data is also stored columnar. does spark SQL already use something like that? Evan mentioned Spark SQL columnar compression, which sounds like it. where can i find that? thanks On Thu, Jan 29, 2015 at 2:32 PM, Evan Chan velvia.git...@gmail.com wrote: +1 having proper NA support is much cleaner than using null, at least the Java null. On Wed, Jan 28, 2015 at 6:10 PM, Evan R. Sparks evan.spa...@gmail.com wrote: You've got to be a little bit careful here. NA in systems like R or pandas may have special meaning that is distinct from null. See, e.g. http://www.r-bloggers.com/r-na-vs-null/ On Wed, Jan 28, 2015 at 4:42 PM, Reynold Xin r...@databricks.com wrote: Isn't that just null in SQL? On Wed, Jan 28, 2015 at 4:41 PM, Evan Chan velvia.git...@gmail.com wrote: I believe that most DataFrame implementations out there, like Pandas, supports the idea of missing values / NA, and some support the idea of Not Meaningful as well. Does Row support anything like that? That is important for certain applications. I thought that Row worked by being a mutable object, but haven't looked into the details in a while. -Evan On Wed, Jan 28, 2015 at 4:23 PM, Reynold Xin r...@databricks.com wrote: It shouldn't change the data source api at all because data sources create RDD[Row], and that gets converted into a DataFrame automatically (previously to SchemaRDD). https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala One thing that will break the data source API in 1.3 is the location of types. Types were previously defined in sql.catalyst.types, and now moved to sql.types. After 1.3, sql.catalyst is hidden from users, and all public APIs have first class classes/objects defined in sql directly. On Wed, Jan 28, 2015 at 4:20 PM, Evan Chan velvia.git...@gmail.com wrote: Hey guys, How does this impact the data sources API? I was planning on using this for a project. +1 that many things from spark-sql / DataFrame is universally desirable and useful. By the way, one thing that prevents the columnar compression stuff in Spark SQL from being more useful is, at least from previous talks with Reynold and Michael et al., that the format was not designed for persistence. I have a new project that aims to change that. It is a zero-serialisation, high performance binary vector library, designed from the outset to be a persistent storage friendly. May be one day it can replace the Spark SQL columnar compression. Michael told me this would be a lot of work, and recreates parts of Parquet, but I think it's worth it. LMK if you'd like more details. -Evan On Tue, Jan 27, 2015 at 4:35 PM, Reynold Xin r...@databricks.com wrote: Alright I have merged the patch ( https://github.com/apache/spark/pull/4173 ) since I don't see any strong opinions against it (as a matter of fact most were for it). We can still change it if somebody lays out a strong argument. On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia matei.zaha...@gmail.com wrote: The type alias means your methods can specify either type and they will work. It's just another name for the same type. But Scaladocs and such will show DataFrame as the type. Matei On Jan 27,
Re: Intellij IDEA 14 env setup; NoClassDefFoundError when run examples
How do you mean you run LogQuery? you would run these using the run-example script rather than in IntelliJ. On Sun, Feb 1, 2015 at 4:01 AM, Yafeng Guo daniel.yafeng@gmail.com wrote: Hi, I'm setting up a dev environment with Intellij IDEA 14. I selected profile scala-2.10, maven-3, hadoop 2.4, hive, hive 0.13.1. The compilation passed. But when I try to run LogQuery in examples, I met below issue: Connected to the target VM, address: '127.0.0.1:37182', transport: 'socket' Exception in thread main java.lang.NoClassDefFoundError: org/apache/spark/SparkConf at org.apache.spark.examples.LogQuery$.main(LogQuery.scala:46) at org.apache.spark.examples.LogQuery.main(LogQuery.scala) Caused by: java.lang.ClassNotFoundException: org.apache.spark.SparkConf at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 2 more Disconnected from the target VM, address: '127.0.0.1:37182', transport: 'socket' anyone met similar issue before? Thanks a lot Regards, Ya-Feng - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Caching tables at column level
I have been working a lot recently with denormalised tables with lots of columns, nearly 600. We are using this form to avoid joins. I have tried to use cache table with this data, but it proves too expensive as it seems to try to cache all the data in the table. For data sets such as the one I am using you find that certain columns will be hot, referenced frequently in queries, others will be used very infrequently. Therefore it would be great if caches could be column based. I realise that this may not be optimal for all use cases, but I think it could be quite a common need. Has something like this been considered? Thanks Mick -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Caching-tables-at-column-level-tp10377.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Any interest in 'weighting' VectorTransformer which does component-wise scaling?
I've added support for sparse vectors and created HadamardTF for the pipeline, please take a look on my branch https://github.com/ogeagla/spark/compare/spark-mllib-weighting . Thanks! -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Any-interest-in-weighting-VectorTransformer-which-does-component-wise-scaling-tp10265p10378.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Caching tables at column level
Its not completely transparent, but you can do something like the following today: CACHE TABLE hotData AS SELECT columns, I, care, about FROM fullTable On Sun, Feb 1, 2015 at 3:03 AM, Mick Davies michael.belldav...@gmail.com wrote: I have been working a lot recently with denormalised tables with lots of columns, nearly 600. We are using this form to avoid joins. I have tried to use cache table with this data, but it proves too expensive as it seems to try to cache all the data in the table. For data sets such as the one I am using you find that certain columns will be hot, referenced frequently in queries, others will be used very infrequently. Therefore it would be great if caches could be column based. I realise that this may not be optimal for all use cases, but I think it could be quite a common need. Has something like this been considered? Thanks Mick -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Caching-tables-at-column-level-tp10377.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Custom Cluster Managers / Standalone Recovery Mode in Spark
For the specific question of supplementing Standalone Mode with a custom leader election protocol, this was actually already committed in master and will be available in Spark 1.3: https://github.com/apache/spark/pull/771/files You can specify spark.deploy.recoveryMode = CUSTOM and spark.deploy.recoveryMode.factory to a class which implements StandaloneRecoveryModeFactory. See the current implementations of FileSystemRecoveryModeFactory and ZooKeeperRecoveryModeFactory. I will update the JIRA you linked to be more current. On Sat, Jan 31, 2015 at 12:55 AM, Anjana Fernando laferna...@gmail.com wrote: Hi everyone, I've been experimenting, and somewhat of a newbie for Spark. I was wondering, if there is any way, that I can use a custom cluster manager implementation with Spark. Basically, as I understood, at the moment, the inbuilt modes supported are with standalone, Mesos and Yarn. My requirement is basically a simple clustering solution with high availability of the master. I don't want to use a separate Zookeeper cluster, since this would complicate my deployment, but rather, I would like to use something like Hazelcast, which has a peer-to-peer cluster coordination implementation. I found that, there is already this JIRA [1], which requests for a custom persistence engine, I guess for storing state information. So basically, what I would want to do is, use Hazelcast to use for leader election, to make an existing node the master, and to lookup the state information from the distributed memory. Appreciate any help on how to archive this. And if it useful for a wider audience, hopefully I can contribute this back to the project. [1] https://issues.apache.org/jira/browse/SPARK-1180 Cheers, Anjana.
Word2Vec IndexedRDD
1. Is IndexedRDD planned for 1.3? https://issues.apache.org/jira/browse/SPARK-2365 2. Once IndexedRDD is in, is it planned to convert Word2VecModel to it from its current Map[String,Array[Float]]? https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala#L425 - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Custom Cluster Managers / Standalone Recovery Mode in Spark
Hi guys, That's great to hear that this is available in Spark 1.3! .. I will play around with this feature and let you know the results for integrating Hazelcast. Also, may I know the tentative release date for Spark 1.3? .. Cheers, Anjana. On Mon, Feb 2, 2015 at 3:07 AM, Aaron Davidson ilike...@gmail.com wrote: For the specific question of supplementing Standalone Mode with a custom leader election protocol, this was actually already committed in master and will be available in Spark 1.3: https://github.com/apache/spark/pull/771/files You can specify spark.deploy.recoveryMode = CUSTOM and spark.deploy.recoveryMode.factory to a class which implements StandaloneRecoveryModeFactory. See the current implementations of FileSystemRecoveryModeFactory and ZooKeeperRecoveryModeFactory. I will update the JIRA you linked to be more current. On Sat, Jan 31, 2015 at 12:55 AM, Anjana Fernando laferna...@gmail.com wrote: Hi everyone, I've been experimenting, and somewhat of a newbie for Spark. I was wondering, if there is any way, that I can use a custom cluster manager implementation with Spark. Basically, as I understood, at the moment, the inbuilt modes supported are with standalone, Mesos and Yarn. My requirement is basically a simple clustering solution with high availability of the master. I don't want to use a separate Zookeeper cluster, since this would complicate my deployment, but rather, I would like to use something like Hazelcast, which has a peer-to-peer cluster coordination implementation. I found that, there is already this JIRA [1], which requests for a custom persistence engine, I guess for storing state information. So basically, what I would want to do is, use Hazelcast to use for leader election, to make an existing node the master, and to lookup the state information from the distributed memory. Appreciate any help on how to archive this. And if it useful for a wider audience, hopefully I can contribute this back to the project. [1] https://issues.apache.org/jira/browse/SPARK-1180 Cheers, Anjana.