Re: scalac crash when compiling DataTypeConversions.scala

2014-10-23 Thread Patrick Wendell
Hey Ryan, I've found that filing issues with the Scala/Typesafe JIRA is pretty helpful if the issue can be fully reproduced, and even sometimes helpful if it can't. You can file bugs here: https://issues.scala-lang.org/secure/Dashboard.jspa The Spark SQL code in particular is typically the

Re: About Memory usage in the Spark UI

2014-10-23 Thread Patrick Wendell
It shows the amount of memory used to store RDD blocks, which are created when you run .cache()/.persist() on an RDD. On Wed, Oct 22, 2014 at 10:07 PM, Haopu Wang hw...@qilinsoft.com wrote: Hi, please take a look at the attached screen-shot. I wonders what's the Memory Used column mean. I

Re: Spark Hive Snappy Error

2014-10-23 Thread arthur.hk.c...@gmail.com
HI Removed export CLASSPATH=$HBASE_HOME/lib/hadoop-snappy-0.0.1-SNAPSHOT.jar” It works, THANK YOU!! Regards Arthur On 23 Oct, 2014, at 1:00 pm, Shao, Saisai saisai.s...@intel.com wrote: Seems you just add snappy library into your classpath: export

Is Spark streaming suitable for our architecture?

2014-10-23 Thread Albert Vila
Hi I'm evaluating Spark streaming to see if it fits to scale or current architecture. We are currently downloading and processing 6M documents per day from online and social media. We have a different workflow for each type of document, but some of the steps are keyword extraction, language

Dynamically loaded Spark-stream consumer

2014-10-23 Thread Jianshi Huang
I have a use case that I need to continuously ingest data from Kafka stream. However apart from ingestion (to HBase), I also need to compute some metrics (i.e. avg for last min, etc.). The problem is that it's very likely I'll continuously add more metrics and I don't want to restart my spark

How to access objects declared and initialized outside the call() method of JavaRDD

2014-10-23 Thread Localhost shell
Hey All, I am unable to access objects declared and initialized outside the call() method of JavaRDD. In the below code snippet, call() method makes a fetch call to C* but since javaSparkContext is defined outside the call method scope so compiler give a compilation error. stringRdd.foreach(new

RE: About Memory usage in the Spark UI

2014-10-23 Thread Haopu Wang
Patrick, thanks for the response. May I ask more questions? I'm running a Spark Streaming application which receives data from socket and does some transformations. The event injection rate is too high so the processing duration is larger than batch interval. So I see Could not

Which is better? One spark app listening to 10 topics vs. 10 spark apps each listening to 1 topic

2014-10-23 Thread Jianshi Huang
The Kafka stream has 10 topics and the data rate is quite high (~ 100K/s per topic). Which configuration do you recommend? - 1 Spark app consuming all Kafka topics - 10 separate Spark app each consuming one topic Assuming they have the same resource pool. Cheers, -- Jianshi Huang LinkedIn:

NoClassDefFoundError on ThreadFactoryBuilder in Intellij

2014-10-23 Thread Stephen Boesch
After having checked out from master/head the following error occurs when attempting to run any test in Intellij Exception in thread main java.lang.NoClassDefFoundError: com/google/common/util/concurrent/ThreadFactoryBuilder at org.apache.spark.util.Utils$.init(Utils.scala:648) There appears to

Re: How to access objects declared and initialized outside the call() method of JavaRDD

2014-10-23 Thread Sean Owen
In Java, javaSparkContext would have to be declared final in order for it to be accessed inside an inner class like this. But this would still not work as the context is not serializable. You should rewrite this so you are not attempting to use the Spark context inside an RDD. On Thu, Oct 23,

Re: Solving linear equations

2014-10-23 Thread Sean Owen
The 0 vector is a trivial solution. Is the data big, such that it can't be computed on one machine? if so I assume this system is over-determined. You can use a decomposition to find a least-squares solution, but the SVD is overkill and in any event distributed decompositions don't exist in the

Re: Is Spark streaming suitable for our architecture?

2014-10-23 Thread Jayant Shekhar
Hi Albert, Have a couple of questions: - You mentioned near real-time. What exactly is your SLA for processing each document? - Which crawler are you using and are you looking to bring in Hadoop into your overall workflow. You might want to read up on how network traffic is

SparkSQL and columnar data

2014-10-23 Thread Marius Soutier
Hi guys, another question: what’s the approach to working with column-oriented data, i.e. data with more than 1000 columns. Using Parquet for this should be fine, but how well does SparkSQL handle the big amount of columns? Is there a limit? Should we use standard Spark instead? Thanks for

Re: Multitenancy in Spark - within/across spark context

2014-10-23 Thread Jianshi Huang
Upvote for the multitanency requirement. I'm also building a data analytic platform and there'll be multiple users running queries and computations simultaneously. One of the paint point is control of resource size. Users don't really know how much nodes they need, they always use as much as

Spark Cassandra Connector proper usage

2014-10-23 Thread Ashic Mahtab
I'm looking to use spark for some ETL, which will mostly consist of update statements (a column is a set, that'll be appended to, so a simple insert is likely not going to work). As such, it seems like issuing CQL queries to import the data is the best option. Using the Spark Cassandra

Re: Is Spark streaming suitable for our architecture?

2014-10-23 Thread Albert Vila
Hi Jayant, On 23 October 2014 11:14, Jayant Shekhar jay...@cloudera.com wrote: Hi Albert, Have a couple of questions: - You mentioned near real-time. What exactly is your SLA for processing each document? The minimum the best :). Right now it's between 30s - 5m, but I would like to

what's the best way to initialize an executor?

2014-10-23 Thread Darin McBeath
I have some code that I only need to be executed once per executor in my spark application.  My current approach is to do something like the following: scala xmlKeyPair.foreachPartition(i = XPathProcessor.init(ats, Namespaces/NamespaceContext)) So, If I understand correctly, the

RE: Spark Cassandra Connector proper usage

2014-10-23 Thread Ashic Mahtab
Hi Gerard, Thanks for the response. Here's the scenario: The target cassandra schema looks like this: create table foo ( id text primary key, bar int, things settext ) The source in question is a Sql Server source providing the necessary data. The source goes over the same id multiple

Aggregation Error: org.apache.spark.sql.catalyst.errors.package$TreeNodeException:

2014-10-23 Thread arthur.hk.c...@gmail.com
Hi, I got $TreeNodeException, few questions: Q1) How should I do aggregation in SparK? Can I use aggregation directly in SQL? or Q1) Should I use SQL to load the data to form RDD then use scala to do the aggregation? Regards Arthur MySQL (good one, without aggregation):

Re: Spark Streaming Applications

2014-10-23 Thread Saiph Kappa
What is the application about? I couldn't find any proper description regarding the purpose of killrweather ( I mean, other than just integrating Spark with Cassandra). Do you know if the slides of that tutorial are available somewhere? Thanks! On Wed, Oct 22, 2014 at 6:58 PM, Sameer Farooqui

Re: Aggregation Error: org.apache.spark.sql.catalyst.errors.package$TreeNodeException:

2014-10-23 Thread Yin Huai
Hello Arthur, You can use do aggregations in SQL. How did you create LINEITEM? Thanks, Yin On Thu, Oct 23, 2014 at 8:54 AM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi, I got $TreeNodeException, few questions: Q1) How should I do aggregation in SparK? Can I use

Re: Spark Cassandra Connector proper usage

2014-10-23 Thread Gerard Maas
Hi Ashic, At the moment I see two options: 1) You could use the CassandraConnector object to execute your specialized query. The recommended pattern is to to that within a rdd.foreachPartition(...) in order to amortize DB connection setup over the number of elements in on partition. Something

RE: Spark Cassandra Connector proper usage

2014-10-23 Thread Ashic Mahtab
Hi Gerard, I've gone with option 1, and seems to be working well. Option 2 is also quite interesting. Thanks for your help in this. Regards, Ashic. From: gerard.m...@gmail.com Date: Thu, 23 Oct 2014 17:07:56 +0200 Subject: Re: Spark Cassandra Connector proper usage To: as...@live.com CC:

Re: what's the best way to initialize an executor?

2014-10-23 Thread Sean Owen
It sounds like your code already does its initialization at most once per JVM, and that's about as good as it gets. Each partition asks for init in a thread-safe way and the first request succeeds. On Thu, Oct 23, 2014 at 1:41 PM, Darin McBeath ddmcbe...@yahoo.com.invalid wrote: I have some code

Re: How to set hadoop native library path in spark-1.1

2014-10-23 Thread Christophe Préaud
Hi, Try the --driver-library-path option of spark-submit, e.g.: /opt/spark/bin/spark-submit --driver-library-path /opt/hadoop/lib/native (...) Regards, Christophe. On 21/10/2014 20:44, Pradeep Ch wrote: Hi all, Can anyone tell me how to set the native library path in Spark. Right not I am

Re: Multitenancy in Spark - within/across spark context

2014-10-23 Thread Marcelo Vanzin
You may want to take a look at https://issues.apache.org/jira/browse/SPARK-3174. On Thu, Oct 23, 2014 at 2:56 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Upvote for the multitanency requirement. I'm also building a data analytic platform and there'll be multiple users running queries and

Re: Setting only master heap

2014-10-23 Thread Andrew Or
Yeah, as Sameer commented, there is unfortunately not an equivalent `SPARK_MASTER_MEMORY` that you can set. You can work around this by starting the master and the slaves separately with different settings of SPARK_DAEMON_MEMORY each time. AFAIK there haven't been any major changes in the

Re: Shuffle issues in the current master

2014-10-23 Thread Andrew Or
To add to Aaron's response, `spark.shuffle.consolidateFiles` only applies to hash-based shuffle, so you shouldn't have to set it for sort-based shuffle. And yes, since you changed neither `spark.shuffle.compress` nor `spark.shuffle.spill.compress` you can't possibly have run into what #2890 fixes.

Re: How to access objects declared and initialized outside the call() method of JavaRDD

2014-10-23 Thread Localhost shell
Bang On Sean Before sending the issue mail, I was able to remove the compilation error by making it final but then got the Caused by: java.io.NotSerializableException: org.apache.spark.api.java.JavaSparkContext (As you mentioned) Now regarding your suggestion of changing the business logic, 1.

Re: How to access objects declared and initialized outside the call() method of JavaRDD

2014-10-23 Thread Jayant Shekhar
+1 to Sean. Is it possible to rewrite your code to not use SparkContext in RDD. Or why does javaFunctions() need the SparkContext. On Thu, Oct 23, 2014 at 10:53 AM, Localhost shell universal.localh...@gmail.com wrote: Bang On Sean Before sending the issue mail, I was able to remove the

Re: Aggregation Error: org.apache.spark.sql.catalyst.errors.package$TreeNodeException:

2014-10-23 Thread arthur.hk.c...@gmail.com
HI, My step to create LINEITEM: $HADOOP_HOME/bin/hadoop fs -mkdir /tpch/lineitem $HADOOP_HOME/bin/hadoop fs -copyFromLocal lineitem.tbl /tpch/lineitem/ Create external table lineitem (L_ORDERKEY INT, L_PARTKEY INT, L_SUPPKEY INT, L_LINENUMBER INT, L_QUANTITY DOUBLE, L_EXTENDEDPRICE DOUBLE,

Re: How to access objects declared and initialized outside the call() method of JavaRDD

2014-10-23 Thread Localhost shell
Hey Jayant, In my previous mail, I have mentioned a github gist *https://gist.github.com/rssvihla/6577359860858ccb0b33 https://gist.github.com/rssvihla/6577359860858ccb0b33 *which is doing very similar to what I want to do but its using scala language for spark. Hence my question (reiterating

Re: Transforming the Dstream vs transforming each RDDs in the Dstream.

2014-10-23 Thread Tathagata Das
Hey Gerard, This is a very good question! *TL;DR: *The performance should be same, except in case of shuffle-based operations where the number of reducers is not explicitly specified. Let me answer in more detail by dividing the set of DStream operations into three categories. *1. Map-like

Re: Spark Streaming Applications

2014-10-23 Thread Tathagata Das
Cc'ing Helena for more information on this. TD On Thu, Oct 23, 2014 at 6:30 AM, Saiph Kappa saiph.ka...@gmail.com wrote: What is the application about? I couldn't find any proper description regarding the purpose of killrweather ( I mean, other than just integrating Spark with Cassandra). Do

JavaHiveContext class not found error. Help!!

2014-10-23 Thread nitinkak001
I am trying to run the below Hive query on Yarn. I am using Cloudera 5.1. What can I do to make this work? /SELECT * FROM table_name DISTRIBUTE BY GEO_REGION, GEO_COUNTRY SORT BY IP_ADDRESS, COOKIE_ID;/ Below is the stack trace: Exception in thread Thread-4

Re: How to access objects declared and initialized outside the call() method of JavaRDD

2014-10-23 Thread lordjoe
What I have been doing is building a JavaSparkContext the first time it is needed and keeping it as a ThreadLocal - All my code uses SparkUtilities.getCurrentContext(). On a Slave machine you build a new context and don't have to serialize it The code is in a large project at

Re: JavaHiveContext class not found error. Help!!

2014-10-23 Thread Marcelo Vanzin
Hello there, This is more of a question for the cdh-users list, but in any case... In CDH 5.1 we skipped packaging of the Hive module in SparkSQL. That has been fixed in CDH 5.2, so if it's possible for you I'd recommend upgrading. On Thu, Oct 23, 2014 at 2:53 PM, nitinkak001

spark is running extremely slow with larger data set, like 2G

2014-10-23 Thread xuhongnever
my spark version is 1.1.0 pre-build with hadoop 1.x my code is implemented in python trying to covert a graph data set in edge list to adjacency list spark is running in standalone mode It runs well with a small data set like soc-liveJournal1, about 1G Then I run it on 25G twitter graph, one

Re: spark is running extremely slow with larger data set, like 2G

2014-10-23 Thread xuhongnever
my code is here: from pyspark import SparkConf, SparkContext def Undirect(edge): vector = edge.strip().split('\t') if(vector[0].isdigit()): return [(vector[0], vector[1])] return [] conf = SparkConf() conf.setMaster(spark://compute-0-14:7077)

Spark 1.1.0 and Hive 0.12.0 Compatibility Issue

2014-10-23 Thread Arthur . hk . chan
Hi My Spark is 1.1.0 and Hive is 0.12, I tried to run the same query in both Hive-0.12.0 then Spark-1.1.0, HiveQL works while SparkSQL failed. hive select l_orderkey, sum(l_extendedprice*(1-l_discount)) as revenue, o_orderdate, o_shippriority from customer c join orders o on c.c_mktsegment

Spark 1.1.0 and Hive 0.12.0 Compatibility Issue

2014-10-23 Thread arthur.hk.c...@gmail.com
(Please ignore if duplicated) Hi, My Spark is 1.1.0 and Hive is 0.12, I tried to run the same query in both Hive-0.12.0 then Spark-1.1.0, HiveQL works while SparkSQL failed. hive select l_orderkey, sum(l_extendedprice*(1-l_discount)) as revenue, o_orderdate, o_shippriority from customer c

Re: unable to make a custom class as a key in a pairrdd

2014-10-23 Thread Niklas Wilcke
Hi Jao, I don't really know why this doesn't work but I have two hints. You don't need to override hashCode and equals. The modifier case is doing that for you. Writing case class PersonID(id: String) would be enough to get the class you want I think. If I change the type of the id param to Int

Re: Exceptions not caught?

2014-10-23 Thread Ted Yu
Can you show the stack trace ? Also, how do you catch exceptions ? Did you specify TProtocolException ? Cheers On Thu, Oct 23, 2014 at 3:40 PM, ankits ankitso...@gmail.com wrote: Hi, I'm running a spark job and encountering an exception related to thrift. I wanted to know where this is

Re: Exceptions not caught?

2014-10-23 Thread ankits
I am simply catching all exceptions (like case e:Throwable = println(caught: +e) ) Here is the stack trace: 2014-10-23 15:51:10,766 ERROR [] Exception in task 1.0 in stage 1.0 (TID 1) java.io.IOException: org.apache.thrift.protocol.TProtocolException: Required field 'X' is unset! Struct:Y(id:,

Re: Spark 1.1.0 and Hive 0.12.0 Compatibility Issue

2014-10-23 Thread Michael Armbrust
Can you show the DDL for the table? It looks like the SerDe might be saying it will produce a decimal type but is actually producing a string. On Thu, Oct 23, 2014 at 3:17 PM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi My Spark is 1.1.0 and Hive is 0.12, I tried to run the

Re: Exceptions not caught?

2014-10-23 Thread ankits
Also everything is running locally on my box, driver and workers. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Exceptions-not-caught-tp17157p17160.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Exceptions not caught?

2014-10-23 Thread Ted Yu
bq. Required field 'X' is unset! Struct:Y Can you check your class Y and fix the above ? Cheers On Thu, Oct 23, 2014 at 3:55 PM, ankits ankitso...@gmail.com wrote: I am simply catching all exceptions (like case e:Throwable = println(caught: +e) ) Here is the stack trace: 2014-10-23

Re: Exceptions not caught?

2014-10-23 Thread ankits
Can you check your class Y and fix the above ? I can, but this is about catching the exception should it be thrown by any class in the spark job. Why is the exception not being caught? -- View this message in context:

Re: Exceptions not caught?

2014-10-23 Thread Marcelo Vanzin
On Thu, Oct 23, 2014 at 3:40 PM, ankits ankitso...@gmail.com wrote: 2014-10-23 15:39:50,845 ERROR [] Exception in task 1.0 in stage 1.0 (TID 1) java.io.IOException: org.apache.thrift.protocol.TProtocolException: This looks like an exception that's happening on an executor and just being

Re: About Memory usage in the Spark UI

2014-10-23 Thread Tathagata Das
The memory usage of blocks of data received through Spark Streaming is not reflected in the Spark UI. It only shows the memory usage due to cached RDDs. I didnt find a JIRA for this, so I opened a new one. https://issues.apache.org/jira/browse/SPARK-4072 TD On Thu, Oct 23, 2014 at 12:47 AM,

Spark using HDFS data [newb]

2014-10-23 Thread matan
Hi, I would like to verify or correct my understanding of Spark at the conceptual level. As I understand, ignoring streaming mode for a minute, Spark takes some input data (which can be an hdfs file), lets your code transform the data, and ultimately dispatches some computation over the data

Problem packing spark-assembly jar

2014-10-23 Thread Yana Kadiyska
Hi folks, I'm trying to deploy the latest from master branch and having some trouble with the assembly jar. In the spark-1.1 official distribution(I use cdh version), I see the following jars, where spark-assembly-1.1.0-hadoop2.0.0-mr1-cdh4.2.0.jar contains a ton of stuff:

RE: About Memory usage in the Spark UI

2014-10-23 Thread Haopu Wang
TD, thanks for the clarification. From the UI, it looks like the driver also allocate memory to store blocks, what's the purpose for that because I think driver doesn't need to run tasks? From: Tathagata Das [mailto:tathagata.das1...@gmail.com] Sent:

Re: unable to make a custom class as a key in a pairrdd

2014-10-23 Thread Prashant Sharma
Are you doing this in REPL ? Then there is a bug filed for this, I just can't recall the bug ID at the moment. Prashant Sharma On Fri, Oct 24, 2014 at 4:07 AM, Niklas Wilcke 1wil...@informatik.uni-hamburg.de wrote: Hi Jao, I don't really know why this doesn't work but I have two hints.

Re: Local Dev Env with Mesos + Spark Streaming on Docker: Can't submit jobs.

2014-10-23 Thread Svend
Hi all, (Waking up an old thread just for future reference) We've had a very similar issue just a couple of days ago: executing a spark driver on the same host as where the mesos master runs succeeds, but executing it on our remote dev station hangs fails after mesos report the spark driver

Re: Spark 1.0.0 on yarn cluster problem

2014-10-23 Thread firemonk9
Hi, I am facing same problem. My spark-env.sh has below entries yet I see the yarn container with only 1G and yarn only spawns two workers. SPARK_EXECUTOR_CORES=1 SPARK_EXECUTOR_MEMORY=3G SPARK_EXECUTOR_INSTANCES=5 Please let me know if you are able to resolve this issue. Thank you --

large benchmark sets for MLlib experiments

2014-10-23 Thread Chih-Jen Lin
Hi MLlib users, In August when I gave a talk at Databricks, Xiangrui mentioned the need of large public data for the development of MLlib. At this moment many use problems in libsvm data sets for experiments. The file size of larger ones (e.g., kddb) is about 20-30G. To fullfill the need, we

Memory requirement of using Spark

2014-10-23 Thread jian.t
Hello, I am new to Spark. I have a basic question about the memory requirement of using Spark. I need to join multiple data sources between multiple data sets. The join is not a straightforward join. The logic is more like: first join T1 on column A with T2, then for all the records that

Re: Spark 1.0.0 on yarn cluster problem

2014-10-23 Thread Andrew Or
Did you `export` the environment variables? Also, are you running in client mode or cluster mode? If it still doesn't work you can try to set these through the spark-submit command lines --num-executors, --executor-cores, and --executor-memory. 2014-10-23 19:25 GMT-07:00 firemonk9

Re: Multitenancy in Spark - within/across spark context

2014-10-23 Thread Evan Chan
Ashwin, I would say the strategies in general are: 1) Have each user submit separate Spark app (each its own Spark Context), with its own resource settings, and share data through HDFS or something like Tachyon for speed. 2) Share a single spark context amongst multiple users, using fair