Re: OOM for HiveFromSpark example

2015-03-26 Thread Akhil Das
Not sure, but you can create that path in all workers and put that file in it. Thanks Best Regards On Thu, Mar 26, 2015 at 1:56 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: The Hive command LOAD DATA LOCAL INPATH

Column not found in schema when querying partitioned table

2015-03-26 Thread Jon Chase
Spark 1.3.0, Parquet I'm having trouble referencing partition columns in my queries. In the following example, 'probeTypeId' is a partition column. For example, the directory structure looks like this: /mydata /probeTypeId=1 ...files... /probeTypeId=2 ...files... I see

Re: OOM for HiveFromSpark example

2015-03-26 Thread Akhil Das
Could you try putting that file in hdfs and try like: LOAD DATA INPATH 'hdfs://sigmoid/test/kv1.txt' INTO TABLE src_spark Thanks Best Regards On Thu, Mar 26, 2015 at 2:07 PM, Akhil Das ak...@sigmoidanalytics.com wrote: When you run it in local mode ^^ Thanks Best Regards On Thu, Mar 26,

Missing an output location for shuffle. : (

2015-03-26 Thread 李铖
Again,when I do larger file Spark-sql query, error occured.Anyone have got fix it .Please help me. Here is the track. org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0 at

Handling Big data for interactive BI tools

2015-03-26 Thread kundan kumar
Hi, I need to store terabytes of data which will be used for BI tools like qlikview. The queries can be on the basis of filter on any column. Currently, we are using redshift for this purpose. I am trying to explore things other than the redshift . Is it possible to gain better performance in

Re: OOM for HiveFromSpark example

2015-03-26 Thread ๏̯͡๏
I don;t think thats correct. load data local should pick input from local directory. On Thu, Mar 26, 2015 at 1:59 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Not sure, but you can create that path in all workers and put that file in it. Thanks Best Regards On Thu, Mar 26, 2015 at 1:56

Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-26 Thread Steve Loughran
On 25 Mar 2015, at 21:54, roni roni.epi...@gmail.commailto:roni.epi...@gmail.com wrote: Is there any way that I can install the new one and remove previous version. I installed spark 1.3 on my EC2 master and set teh spark home to the new one. But when I start teh spark-shell I get -

Write Parquet File with spark-streaming with Spark 1.3

2015-03-26 Thread Richard Grossman
Hi I've succeed to write kafka stream to parquet file in Spark 1.2 but I can't make it with spark 1.3 As in streaming I can't use saveAsParquetFile() because I can't add data to an existing parquet File I know that it's possible to stream data directly into parquet could you help me by providing

Re: OOM for HiveFromSpark example

2015-03-26 Thread ๏̯͡๏
The Hive command LOAD DATA LOCAL INPATH '/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/examples/src/main/resources/kv1.txt' INTO TABLE src_spark 1. LOCAL INPATH. if i push to HDFS then how will it work ? 2. I cant use sc.addFile, cause i want to run Hive (Spark SQL) queries. On Thu, Mar

Re: OOM for HiveFromSpark example

2015-03-26 Thread Akhil Das
When you run it in local mode ^^ Thanks Best Regards On Thu, Mar 26, 2015 at 2:06 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: I don;t think thats correct. load data local should pick input from local directory. On Thu, Mar 26, 2015 at 1:59 PM, Akhil Das ak...@sigmoidanalytics.com wrote:

Re: Registering custom UDAFs with HiveConetxt in SparkSQL, how?

2015-03-26 Thread ๏̯͡๏
Hello Shahab, Are you able to read tables created in Hive from Spark SQL ? If yes, how are you referring them ? On Thu, Mar 26, 2015 at 1:11 PM, Takeshi Yamamuro linguin@gmail.com wrote: I think it is not `sqlContext` but hiveContext because `create temporary function` is not supported in

Re: Column not found in schema when querying partitioned table

2015-03-26 Thread Jon Chase
I've filed this as https://issues.apache.org/jira/browse/SPARK-6554 On Thu, Mar 26, 2015 at 6:29 AM, Jon Chase jon.ch...@gmail.com wrote: Spark 1.3.0, Parquet I'm having trouble referencing partition columns in my queries. In the following example, 'probeTypeId' is a partition column. For

Re: Hive Table not from from Spark SQL

2015-03-26 Thread ๏̯͡๏
I have tables dw_bid that is created in Hive and has nothing to do with Spark. I have data in avro that i want to join with dw_bid table, this join needs to be done using Spark SQL. However for some reason Spark says dw_bid table does not exist. How do i say spark that dw_bid is a table created

Re: How to deploy binary dependencies to workers?

2015-03-26 Thread Xi Shen
OK, after various testing, I found the native library can be loaded if running in yarn-cluster mode. But I still cannot find out why it won't load when running in yarn-client mode... Thanks, David On Thu, Mar 26, 2015 at 4:21 PM Xi Shen davidshe...@gmail.com wrote: Not of course...all

Re: Write Parquet File with spark-streaming with Spark 1.3

2015-03-26 Thread Cheng Lian
You may resort to the generic save API introduced in 1.3, which supports appending as long as the target data source supports it. And in 1.3, Parquet does support appending. Cheng On 3/26/15 4:13 PM, Richard Grossman wrote: Hi I've succeed to write kafka stream to parquet file in Spark 1.2

Re: Windowing and Analytics Functions in Spark SQL

2015-03-26 Thread Masf
Ok, Thanks. Some web resource where I could check the functionality supported by Spark SQL? Thanks!!! Regards. Miguel Ángel. On Thu, Mar 26, 2015 at 12:31 PM, Cheng Lian lian.cs@gmail.com wrote: We're working together with AsiaInfo on this. Possibly will deliver an initial version of

Re: Windowing and Analytics Functions in Spark SQL

2015-03-26 Thread Arush Kharbanda
You can look at the Spark SQL programming guide. http://spark.apache.org/docs/1.3.0/sql-programming-guide.html and the Spark API. http://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.package On Thu, Mar 26, 2015 at 5:21 PM, Masf masfwo...@gmail.com wrote: Ok, Thanks. Some

Why executor encourage OutOfMemoryException: Java heap space

2015-03-26 Thread sergunok
Hi all, sometimes you can see OutOfMemoryException: Java heap space of executor in Spark. There many ideas about how to work arounds. My question is: how does executor execute tasks from the point of view of memory usage and parallelism? Picture in my mind is: Executor is JVM instance. Number

Re: Hive Table not from from Spark SQL

2015-03-26 Thread ๏̯͡๏
I have this query insert overwrite table sojsuccessevents2_spark select guid,sessionKey,sessionStartDate,sojDataDate,seqNum,eventTimestamp,siteId,successEventType,sourceType,itemId, shopCartId,b.transaction_Id as transactionId,offerId,b.bdr_id as

Re: Windowing and Analytics Functions in Spark SQL

2015-03-26 Thread Arush Kharbanda
Its not yet implemented. https://issues.apache.org/jira/browse/SPARK-1442 On Thu, Mar 26, 2015 at 4:39 PM, Masf masfwo...@gmail.com wrote: Hi. Are the Windowing and Analytics functions supported in Spark SQL (with HiveContext or not)? For example in Hive is supported

Re: SparkSQL overwrite parquet file does not generate _common_metadata

2015-03-26 Thread Cheng Lian
I couldn’t reproduce this with the following spark-shell snippet: |scala import sqlContext.implicits._ scala Seq((1, 2)).toDF(a, b) scala res0.save(xxx, org.apache.spark.sql.SaveMode.Overwrite) scala res0.save(xxx, org.apache.spark.sql.SaveMode.Overwrite) | The _common_metadata file is

Port configuration for BlockManagerId

2015-03-26 Thread Manish Gupta 8
Hi, I am running spark-shell and connecting with a yarn cluster with deploy mode as client. In our environment, there are some security policies that doesn't allow us to open all TCP port. Issue I am facing is: Spark Shell driver is using a random port for BlockManagerID -

Windowing and Analytics Functions in Spark SQL

2015-03-26 Thread Masf
Hi. Are the Windowing and Analytics functions supported in Spark SQL (with HiveContext or not)? For example in Hive is supported https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics Some tutorial or documentation where I can see all features supported by Spark

Re: Handling Big data for interactive BI tools

2015-03-26 Thread Akhil Das
Yes, you can easily configure Spark Thrift server and connect BI Tools. Here's an example https://hadoopi.wordpress.com/2014/12/31/spark-connect-tableau-desktop-to-sparksql/ showing how to integrate SparkSQL with Tableau dashboards. Thanks Best Regards On Thu, Mar 26, 2015 at 3:56 PM, kundan

Why k-means cluster hang for a long time?

2015-03-26 Thread Xi Shen
Hi, When I run k-means cluster with Spark, I got this in the last two lines in the log: 15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned broadcast 26 15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned shuffle 5 Then it hangs for a long time. There's no active job. The driver machine is

Re: Windowing and Analytics Functions in Spark SQL

2015-03-26 Thread Cheng Lian
We're working together with AsiaInfo on this. Possibly will deliver an initial version of window function support in 1.4.0. But it's not a promise yet. Cheng On 3/26/15 7:27 PM, Arush Kharbanda wrote: Its not yet implemented. https://issues.apache.org/jira/browse/SPARK-1442 On Thu, Mar 26,

Re: Handling Big data for interactive BI tools

2015-03-26 Thread Jörn Franke
You can also preaggregate results for the queries by the user - depending on what queries they use this might be necessary for any underlying technology Le 26 mars 2015 11:27, kundan kumar iitr.kun...@gmail.com a écrit : Hi, I need to store terabytes of data which will be used for BI tools

Populating a HashMap from a GraphX connectedComponents graph

2015-03-26 Thread Bob DuCharme
The Scala code below was based on https://www.sics.se/~amir/files/download/dic/answers6.pdf. I extended it by adding a HashMap called componentLists that I populated with each component's starting node as the key and then a ListBuffer of the component's members. As the output below the code

Re: Spark-1.3.0 UI shows 0 cores in completed applications tab

2015-03-26 Thread Sean Owen
https://issues.apache.org/jira/browse/SPARK-5771 ? On Thu, Mar 26, 2015 at 12:58 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote: Hi all, I started spark-shell in spark-1.3.0 and did some actions. The UI was showing 8 cores under the running applications tab. But when I exited the spark-shell

Spark-1.3.0 UI shows 0 cores in completed applications tab

2015-03-26 Thread MEETHU MATHEW
Hi all, I started spark-shell in spark-1.3.0 and did some actions. The UI was showing 8 cores under the running applications tab. But when I exited the spark-shell using exit, the application is moved to completed applications tab and the number of cores is 0. Again when I exited the

Re: Spark-core and guava

2015-03-26 Thread Sean Owen
This is a long and complicated story. In short, Spark shades Guava 14 except for a few classes that were accidentally used in a public API (Optional and a few more it depends on). So provided is more of a Maven workaround to achieve a desired effect. It's not provided in the usual sense. On Thu,

Which RDD operations preserve ordering?

2015-03-26 Thread sergunok
Hi guys, I don't have exact picture about preserving of ordering of elements of RDD after executing of operations. Which operations preserve it? 1) Map (Yes?) 2) ZipWithIndex (Yes or sometimes yes?) Serg. -- View this message in context:

Spark-core and guava

2015-03-26 Thread Stevo Slavić
Hello Apache Spark community, spark-core 1.3.0 has guava 14.0.1 as provided dependency (see http://repo1.maven.org/maven2/org/apache/spark/spark-core_2.10/1.3.0/spark-core_2.10-1.3.0.pom ) What is supposed to provide guava, and that specific version? Kind regards, Stevo Slavic.

Re: Handling Big data for interactive BI tools

2015-03-26 Thread kundan kumar
I looking for some options and came across http://www.jethrodata.com/ On Thu, Mar 26, 2015 at 5:47 PM, Jörn Franke jornfra...@gmail.com wrote: You can also preaggregate results for the queries by the user - depending on what queries they use this might be necessary for any underlying

Re: Handling Big data for interactive BI tools

2015-03-26 Thread kundan kumar
I was looking for some options and came across JethroData. http://www.jethrodata.com/ This stores the data maintaining indexes over all the columns seems good and claims to have better performance than Impala. Earlier I had tried Apache Phoenix because of its secondary indexing feature. But the

Re: How to get a top X percent of a distribution represented as RDD

2015-03-26 Thread Debasish Das
You can do it in-memory as wellget 10% topK elements from each partition and use merge from any sort algorithm like timsortbasically aggregateBy Your version uses shuffle but this version is 0 shuffle..assuming your data set is cached you will be using in-memory allReduce through

Re: spark-sql throws org.datanucleus.store.rdbms.connectionpool.DatastoreDriverNotFoundException

2015-03-26 Thread Cheng Lian
Hey Deepak, It seems that your hive-site.xml says your Hive metastore setup is using MySQL. If that's not the case, you need to adjust your hive-site.xml configurations. As for the version of MySQL driver, it should match the MySQL server. Cheng On 3/27/15 11:07 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: I

spark-sql throws org.datanucleus.store.rdbms.connectionpool.DatastoreDriverNotFoundException

2015-03-26 Thread ๏̯͡๏
I am unable to run spark-sql form command line. I attempted the following 1) export SPARK_HOME=/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4 export SPARK_JAR=$SPARK_HOME/lib/spark-assembly-1.3.0-hadoop2.4.0.jar export

SparkContext.wholeTextFiles throws not serializable exception

2015-03-26 Thread Xi Shen
Hi, I want to load my data in this way: sc.wholeTextFiles(opt.input) map { x = (x._1, x._2.lines.filter(!_.isEmpty).toSeq) } But I got java.io.NotSerializableException: scala.collection.Iterator$$anon$13 But if I use x._2.split('\n'), I can get the expected result. I want to know what's

Re: What is best way to run spark job in yarn-cluster mode from java program(servlet container) and NOT using spark-submit command.

2015-03-26 Thread Noorul Islam K M
Sandy Ryza sandy.r...@cloudera.com writes: Creating a SparkContext and setting master as yarn-cluster unfortunately will not work. SPARK-4924 added APIs for doing this in Spark, but won't be included until 1.4. -Sandy Did you look into something like [1]? With that you can make rest API

Re: How to get a top X percent of a distribution represented as RDD

2015-03-26 Thread Debasish Das
Idea is to use a heap and get topK elements from every partition...then use aggregateBy and for combOp do a merge routine from mergeSort...basically get 100 items from partition 1, 100 items from partition 2, merge them so that you get sorted 200 items and take 100...for merge you can use heap as

Spark SQL configurations

2015-03-26 Thread ๏̯͡๏
Hello, Can someone share me the list of commands (including export statements) that you use to run Spark SQL over YARN cluster. I am unable to get it running on my YARN cluster and running into exceptions. I understand i need to share specific exception. This is more like i want to know if i have

k-means can only run on one executor with one thread?

2015-03-26 Thread Xi Shen
Hi, I have a large data set, and I expects to get 5000 clusters. I load the raw data, convert them into DenseVector; then I did repartition and cache; finally I give the RDD[Vector] to KMeans.train(). Now the job is running, and data are loaded. But according to the Spark UI, all data are

Re: Cross-compatibility of YARN shuffle service

2015-03-26 Thread Sandy Ryza
Hi Matt, I'm not sure whether we have documented compatibility guidelines here. However, a strong goal is to keep the external shuffle service compatible so that many versions of Spark can run against the same shuffle service. -Sandy On Wed, Mar 25, 2015 at 6:44 PM, Matt Cheah

Re: spark-sql throws org.datanucleus.store.rdbms.connectionpool.DatastoreDriverNotFoundException

2015-03-26 Thread ๏̯͡๏
I do not use MySQL, i want to read Hive tables from Spark SQL and transform them in Spark SQL. Why do i need a MySQL driver ? If i still need it which version should i use. Assuming i need it, i downloaded the latest version of it from

Re: spark-sql throws org.datanucleus.store.rdbms.connectionpool.DatastoreDriverNotFoundException

2015-03-26 Thread Denny Lee
If you're not using MySQL as your metastore for Hive, out of curiosity what are you using? The error you are seeing is common when there isn't the correct driver to allow Spark to connect to the Hive metastore because the correct driver isn't there. As well, I noticed that you're using

Re: spark-sql throws org.datanucleus.store.rdbms.connectionpool.DatastoreDriverNotFoundException

2015-03-26 Thread Cheng Lian
As the exception suggests, you don't have MySQL JDBC driver on your classpath. On 3/27/15 10:45 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: I am unable to run spark-sql form command line. I attempted the following 1) export SPARK_HOME=/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4 export

Re: SparkSQL overwrite parquet file does not generate _common_metadata

2015-03-26 Thread Pei-Lun Lee
Hi Cheng, on my computer, execute res0.save(xxx, org.apache.spark.sql.SaveMode. Overwrite) produces: peilunlee@pllee-mini:~/opt/spark-1.3...rc3-bin-hadoop1$ ls -l xxx total 32 -rwxrwxrwx 1 peilunlee staff0 Mar 27 11:29 _SUCCESS* -rwxrwxrwx 1 peilunlee staff 272 Mar 27 11:29

Re: SparkContext.wholeTextFiles throws not serializable exception

2015-03-26 Thread Xi Shen
I have to use .lines.toArray.toSeq A little tricky. [image: --] Xi Shen [image: http://]about.me/davidshen http://about.me/davidshen?promo=email_sig http://about.me/davidshen On Fri, Mar 27, 2015 at 4:41 PM, Xi Shen davidshe...@gmail.com wrote: Hi, I want to load my data in this way:

Re: Combining Many RDDs

2015-03-26 Thread Noorul Islam K M
Yang Chen y...@yang-cs.com writes: Hi Noorul, Thank you for your suggestion. I tried that, but ran out of memory. I did some search and found some suggestions that we should try to avoid rdd.union(

How to get a top X percent of a distribution represented as RDD

2015-03-26 Thread Aung Htet
Hi all, I have a distribution represented as an RDD of tuples, in rows of (segment, score) For each segment, I want to discard tuples with top X percent scores. This seems hard to do in Spark RDD. A naive algorithm would be - 1) Sort RDD by segment score (descending) 2) Within each segment,

Re: How to get a top X percent of a distribution represented as RDD

2015-03-26 Thread Aung Htet
Hi Debasish, Thanks for your suggestions. In-memory version is quite useful. I do not quite understand how you can use aggregateBy to get 10% top K elements. Can you please give an example? Thanks, Aung On Fri, Mar 27, 2015 at 2:40 PM, Debasish Das debasish.da...@gmail.com wrote: You can do

Re: How to get a top X percent of a distribution represented as RDD

2015-03-26 Thread Charles Hayden
?You could also consider using a count-min data structure such as in https://github.com/laserson/dsq? to get approximate quantiles, then use whatever values you want to filter the original sequence. From: Debasish Das debasish.da...@gmail.com Sent: Thursday,

DataFrame GroupBy

2015-03-26 Thread gtanguy
Hello everybody, I am trying to do a simple groupBy : *Code:* val df = hiveContext.sql(SELECT * FROM table1) df .printSchema() df .groupBy(customer_id).count().show(5) *Stacktrace* : root |-- customer_id: string (nullable = true) |-- rank: string (nullable = true) |-- reco_material_id:

Re: HQL function Rollup and Cube

2015-03-26 Thread Chang Lim
Clarification on how the HQL was invoked: hiveContext.sql(select a, b, count(*) from t group by a, b with rollup) Thanks, Chang -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/HQL-function-Rollup-and-Cube-tp22241p22244.html Sent from the Apache Spark

Re: Combining Many RDDs

2015-03-26 Thread Noorul Islam K M
sparkx y...@yang-cs.com writes: Hi, I have a Spark job and a dataset of 0.5 Million items. Each item performs some sort of computation (joining a shared external dataset, if that does matter) and produces an RDD containing 20-500 result items. Now I would like to combine all these RDDs and

Re: Handling Big data for interactive BI tools

2015-03-26 Thread Denny Lee
BTW, a tool that I have been using to help do the preaggregation of data using hyperloglog in combination with Spark is atscale (http://atscale.com/). It builds the aggregations and makes use of the speed of SparkSQL - all within the context of a model that is accessible by Tableau or Qlik. On

Re: Spark-core and guava

2015-03-26 Thread Stevo Slavić
Thanks for heads up Sean! On Mar 26, 2015 1:30 PM, Sean Owen so...@cloudera.com wrote: This is a long and complicated story. In short, Spark shades Guava 14 except for a few classes that were accidentally used in a public API (Optional and a few more it depends on). So provided is more of a

Re: How to get rdd count() without double evaluation of the RDD?

2015-03-26 Thread Mark Hamstra
You can also always take the more extreme approach of using SparkContext#runJob (or submitJob) to write a custom Action that does what you want in one pass. Usually that's not worth the extra effort. On Thu, Mar 26, 2015 at 9:27 AM, Sean Owen so...@cloudera.com wrote: To avoid computing twice

How to get rdd count() without double evaluation of the RDD?

2015-03-26 Thread Wang, Ningjun (LNG-NPV)
I have a rdd that is expensive to compute. I want to save it as object file and also print the count. How can I avoid double computation of the RDD? val rdd = sc.textFile(someFile).map(line = expensiveCalculation(line)) val count = rdd.count() // this force computation of the rdd

Re: python : Out of memory: Kill process

2015-03-26 Thread Eduardo Cusa
the last try was without log2.cache() and still getting out of memory I using the following conf, maybe help: conf = (SparkConf() .setAppName(LoadS3) .set(spark.executor.memory, 13g) .set(spark.driver.memory, 13g) .set(spark.driver.maxResultSize,2g)

Re: How to get rdd count() without double evaluation of the RDD?

2015-03-26 Thread Sean Owen
To avoid computing twice you need to persist the RDD but that need not be in memory. You can persist to disk with persist(). On Mar 26, 2015 4:11 PM, Wang, Ningjun (LNG-NPV) ningjun.w...@lexisnexis.com wrote: I have a rdd that is expensive to compute. I want to save it as object file and also

Combining Many RDDs

2015-03-26 Thread sparkx
Hi, I have a Spark job and a dataset of 0.5 Million items. Each item performs some sort of computation (joining a shared external dataset, if that does matter) and produces an RDD containing 20-500 result items. Now I would like to combine all these RDDs and perform a next job. What I have found

HQL function Rollup and Cube

2015-03-26 Thread Chang Lim
Has anyone been able to use Hive 0.13 ROLLUP and CUBE functions in Spark 1.3's Hive Context? According to https://issues.apache.org/jira/browse/SPARK-2663, this has been resolved in Spark 1.3. I created an in-memory temp table (t) and tried to execute a ROLLUP(and CUBE) function: select a,

Re: What his the ideal method to interact with Spark Cluster from a Cloud App?

2015-03-26 Thread Noorul Islam K M
Today I found one answer from a this thread [1] which seems to be worth exploring. Michael, if you are reading this, it will be helpful if you could share more about your spark deployment in production. Thanks and Regards Noorul [1]

Re: python : Out of memory: Kill process

2015-03-26 Thread Eduardo Cusa
I running on ec2 : 1 Master : 4 CPU 15 GB RAM (2 GB swap) 2 Slaves 4 CPU 15 GB RAM the uncompressed dataset size is 15 GB On Thu, Mar 26, 2015 at 10:41 AM, Eduardo Cusa eduardo.c...@usmediaconsulting.com wrote: Hi Davies, I upgrade to 1.3.0 and still getting Out of Memory. I ran the

Re: python : Out of memory: Kill process

2015-03-26 Thread Davies Liu
Could you try to remove the line `log2.cache()` ? On Thu, Mar 26, 2015 at 10:02 AM, Eduardo Cusa eduardo.c...@usmediaconsulting.com wrote: I running on ec2 : 1 Master : 4 CPU 15 GB RAM (2 GB swap) 2 Slaves 4 CPU 15 GB RAM the uncompressed dataset size is 15 GB On Thu, Mar 26, 2015

Re: iPython Notebook + Spark + Accumulo -- best practice?

2015-03-26 Thread Nick Pentreath
I'm guessing the Accumulo Key and Value classes are not serializable, so you would need to do something like  val rdd = sc.newAPIHadoopRDD(...).map { case (key, value) = (extractScalaType(key), extractScalaType(value)) } Where 'extractScalaType converts the key or Value to a standard

Re: iPython Notebook + Spark + Accumulo -- best practice?

2015-03-26 Thread David Holiday
progress! i was able to figure out why the 'input INFO not set' error was occurring. the eagle-eyed among you will no doubt see the following code is missing a closing '(' AbstractInputFormat.setConnectorInfo(jobConf, root, new PasswordToken(password) as I'm doing this in spark-notebook,

Re: HQL function Rollup and Cube

2015-03-26 Thread Chang Lim
Solved. In IDE, project settings was missing the dependent lib jars (jar files under spark-xx/lib). When theses jar is not set, I got class not found error about datanucleus classes (compared to an out of memory error in Spark Shell). In the context of Spark Shell, these dependent jars needs to

Re: Parallel actions from driver

2015-03-26 Thread Sean Owen
You can do this much more simply, I think, with Scala's parallel collections (try .par). There's nothing wrong with doing this, no. Here, something is getting caught in your closure, maybe unintentionally, that's not serializable. It's not directly related to the parallelism. On Thu, Mar 26,

EsHadoopSerializationException: java.net.SocketTimeoutException: Read timed out

2015-03-26 Thread Adrian Mocanu
Hi I need help fixing a time out exception thrown from ElasticSearch Hadoop. The ES cluster is up all the time. I use ElasticSearch Hadoop to read data from ES into RDDs. I get a collection of these RDD which I traverse (with foreachRDD) and create more RDDs from each one RDD in the collection.

Re: iPython Notebook + Spark + Accumulo -- best practice?

2015-03-26 Thread Russ Weeks
Hi, David, This is the code that I use to create a JavaPairRDD from an Accumulo table: JavaSparkContext sc = new JavaSparkContext(conf); Job hadoopJob = Job.getInstance(conf,TestSparkJob); job.setInputFormatClass(AccumuloInputFormat.class); AccumuloInputFormat.setZooKeeperInstance(job,

Re: iPython Notebook + Spark + Accumulo -- best practice?

2015-03-26 Thread Corey Nolet
Spark uses a SerializableWritable [1] to java serialize writable objects. I've noticed (at least in Spark 1.2.1) that it breaks down with some objects when Kryo is used instead of regular java serialization. Though it is wrapping the actual AccumuloInputFormat (another example of something you

Re: python : Out of memory: Kill process

2015-03-26 Thread Davies Liu
Could you narrow down to a step which cause the OOM, something like: log2= self.sqlContext.jsonFile(path) log2.count() ... out.count() ... On Thu, Mar 26, 2015 at 10:34 AM, Eduardo Cusa eduardo.c...@usmediaconsulting.com wrote: the last try was without log2.cache() and still getting out of

Re: RDD to DataFrame for using ALS under org.apache.spark.ml.recommendation.ALS

2015-03-26 Thread Chang Lim
After this line: val sc = new SparkContext(conf) You need to add this line: import sc.implicits._ //this is used to implicitly convert an RDD to a DataFrame. Hope this helps -- View this message in context:

FakeClassTag in Java API

2015-03-26 Thread kmader
The JavaAPI uses FakeClassTag for all of the implicit class tags fed to RDDs during creation, mapping, etc. I am working on a more generic Scala library where I won't always have the type information beforehand. Is it safe / accepted practice to use FakeClassTag in these situations as well? It was

Re: Spark History Server : jobs link doesn't open

2015-03-26 Thread , Roy
in log I found this 2015-03-26 19:42:09,531 WARN org.eclipse.jetty.servlet.ServletHandler: Error for /history/application_1425934191900_87572 org.spark-project.guava.common.util.concurrent.ExecutionError: java.lang.OutOfMemoryError: GC overhead limit exceeded at

Re: Why k-means cluster hang for a long time?

2015-03-26 Thread Xi Shen
Hi Burak, After I added .repartition(sc.defaultParallelism), I can see from the log the partition number is set to 32. But in the Spark UI, it seems all the data are loaded onto one executor. Previously they were loaded onto 4 executors. Any idea? Thanks, David On Fri, Mar 27, 2015 at 11:01

Re: WordCount example

2015-03-26 Thread Mohit Anchlia
What's the best way to troubleshoot inside spark to see why Spark is not connecting to nc on port ? I don't see any errors either. On Thu, Mar 26, 2015 at 2:38 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I am trying to run the word count example but for some reason it's not working as

Re: python : Out of memory: Kill process

2015-03-26 Thread Eduardo Cusa
Hi Davies, I upgrade to 1.3.0 and still getting Out of Memory. I ran the same code as before, I need to make any changes? On Wed, Mar 25, 2015 at 4:00 PM, Davies Liu dav...@databricks.com wrote: With batchSize = 1, I think it will become even worse. I'd suggest to go with 1.3, have a

RDD equivalent of HBase Scan

2015-03-26 Thread Stuart Layton
HBase scans come with the ability to specify filters that make scans very fast and efficient (as they let you seek for the keys that pass the filter). Do RDD's or Spark DataFrames offer anything similar or would I be required to use a NoSQL db like HBase to do something like this? -- Stuart

Re: RDD equivalent of HBase Scan

2015-03-26 Thread Ted Yu
In examples//src/main/scala/org/apache/spark/examples/HBaseTest.scala, TableInputFormat is used. TableInputFormat accepts parameter public static final String SCAN = hbase.mapreduce.scan; where if specified, Scan object would be created from String form: if (conf.get(SCAN) != null) {

Spark log shows only this line repeated: RecurringTimer - JobGenerator] DEBUG o.a.s.streaming.util.RecurringTimer - Callback for JobGenerator called at time X

2015-03-26 Thread Adrian Mocanu
Here's my log output from a streaming job. What is this? 09:54:27.504 [RecurringTimer - JobGenerator] DEBUG o.a.s.streaming.util.RecurringTimer - Callback for JobGenerator called at time 1427378067504 09:54:27.505 [RecurringTimer - JobGenerator] DEBUG o.a.s.streaming.util.RecurringTimer -

Re: RDD equivalent of HBase Scan

2015-03-26 Thread Stuart Layton
Thanks but I'm hoping to get away from hbase all together. I was wondering if there is a way to get similar scan performance directly on cached rdd's or data frames On Thu, Mar 26, 2015 at 9:54 AM, Ted Yu yuzhih...@gmail.com wrote: In

Re: Which RDD operations preserve ordering?

2015-03-26 Thread Ted Yu
This is related: https://issues.apache.org/jira/browse/SPARK-6340 On Thu, Mar 26, 2015 at 5:58 AM, sergunok ser...@gmail.com wrote: Hi guys, I don't have exact picture about preserving of ordering of elements of RDD after executing of operations. Which operations preserve it? 1) Map

RDD.map does not allowed to preservesPartitioning?

2015-03-26 Thread Zhan Zhang
Hi Folks, Does anybody know what is the reason not allowing preserverPartitioning in RDD.map? Do I miss something here? Following example involves two shuffles. I think if preservePartitioning is allowed, we can avoid the second one, right? val r1 =

Re: RDD.map does not allowed to preservesPartitioning?

2015-03-26 Thread Jonathan Coveney
I believe if you do the following: sc.parallelize(List(1,2,3,4,5,5,6,6,7,8,9,10,2,4)).map((_,1)).reduceByKey(_+_).mapValues(_+1).reduceByKey(_+_).toDebugString (8) MapPartitionsRDD[34] at reduceByKey at console:23 [] | MapPartitionsRDD[33] at mapValues at console:23 [] | ShuffledRDD[32] at

Re: iPython Notebook + Spark + Accumulo -- best practice?

2015-03-26 Thread andy petrella
That purely awesome! Don't hesitate to contribute your notebook back to the spark notebook repo, even rough, I'll help cleaning up if needed. The vagrant is also appealing  Congrats! Le jeu 26 mars 2015 22:22, David Holiday dav...@annaisystems.com a écrit : w0t! that did it!

RE: EsHadoopSerializationException: java.net.SocketTimeoutException: Read timed out

2015-03-26 Thread Adrian Mocanu
I also get stack overflow every now and then without having any recursive calls: java.lang.StackOverflowError: null at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1479) ~[na:1.7.0_75] at

Re: RDD.map does not allowed to preservesPartitioning?

2015-03-26 Thread Zhan Zhang
Thanks Jonathan. You are right regarding rewrite the example. I mean providing such option to developer so that it is controllable. The example may seems silly, and I don’t know the use cases. But for example, if I also want to operate both the key and value part to generate some new value

Re: Combining Many RDDs

2015-03-26 Thread Kelvin Chu
Hi, I used union() before and yes it may be slow sometimes. I _guess_ your variable 'data' is a Scala collection and compute() returns an RDD. Right? If yes, I tried the approach below to operate on one RDD only during the whole computation (Yes, I also saw that too many RDD hurt performance).

Can't access file in spark, but can in hadoop

2015-03-26 Thread Dale Johnson
There seems to be a special kind of corrupted according to Spark state of file in HDFS. I have isolated a set of files (maybe 1% of all files I need to work with) which are producing the following stack dump when I try to sc.textFile() open them. When I try to open directories, most large

Re: RDD.map does not allowed to preservesPartitioning?

2015-03-26 Thread Jonathan Coveney
This is just a deficiency of the api, imo. I agree: mapValues could definitely be a function (K, V)=V1. The option isn't set by the function, it's on the RDD. So you could look at the code and do this. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala

Re: RDD.map does not allowed to preservesPartitioning?

2015-03-26 Thread Patrick Wendell
I think we have a version of mapPartitions that allows you to tell Spark the partitioning is preserved: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L639 We could also add a map function that does same. Or you can just write your map using an

Error in creating log directory

2015-03-26 Thread pzilaro
I get the following error message when I start pyspark shell. The config has the following settings- # spark.masterspark://master:7077 # spark.eventLog.enabled true # spark.eventLog.dir hdfs://namenode:8021/directory # spark.serializer

Re: Spark SQL queries hang forever

2015-03-26 Thread Michael Armbrust
Is it possible to jstack the executors and see where they are hanging? On Thu, Mar 26, 2015 at 2:02 PM, Jon Chase jon.ch...@gmail.com wrote: Spark 1.3.0 on YARN (Amazon EMR), cluster of 10 m3.2xlarge (8cpu, 30GB), executor memory 20GB, driver memory 10GB I'm using Spark SQL, mainly via

Re: iPython Notebook + Spark + Accumulo -- best practice?

2015-03-26 Thread David Holiday
w0t! that did it! t/y so much! I'm going to put together a pastebin or something that has all the code put together so if anyone else runs into this issue they will have some working code to help them figure out what's going on. DAVID HOLIDAY Software Engineer 760 607 3300 |

WordCount example

2015-03-26 Thread Mohit Anchlia
I am trying to run the word count example but for some reason it's not working as expected. I start nc server on port and then submit the spark job to the cluster. Spark job gets successfully submitting but I never see any connection from spark getting established. I also tried to type words

Spark SQL queries hang forever

2015-03-26 Thread Jon Chase
Spark 1.3.0 on YARN (Amazon EMR), cluster of 10 m3.2xlarge (8cpu, 30GB), executor memory 20GB, driver memory 10GB I'm using Spark SQL, mainly via spark-shell, to query 15GB of data spread out over roughly 2,000 Parquet files and my queries frequently hang. Simple queries like select count(*) from

  1   2   >