Re: Database operations on executor nodes

2015-03-19 Thread Akhil Das
Totally depends on your database, if that's a NoSQL database like MongoDB/HBase etc then you can use the native .saveAsNewAPIHAdoopFile or .saveAsHadoopDataSet etc. For a SQL databases, i think people usually puts the overhead on driver like you did. Thanks Best Regards On Wed, Mar 18, 2015 at

Re: Spark + Kafka

2015-03-19 Thread James King
Thanks Khanderao. On Wed, Mar 18, 2015 at 7:18 PM, Khanderao Kand Gmail khanderao.k...@gmail.com wrote: I have used various version of spark (1.0, 1.2.1) without any issues . Though I have not significantly used kafka with 1.3.0 , a preliminary testing revealed no issues . - khanderao

Error while Insert data into hive table via spark

2015-03-19 Thread Dhimant
Hi, I have configured apache spark 1.3.0 with hive 1.0.0 and hadoop 2.6.0. I am able to create table and retrive data from hive tables via following commands ,but not able insert data into table. scala sqlContext.sql(CREATE TABLE IF NOT EXISTS newtable (key INT)); scala sqlContext.sql(select *

RE: configure number of cached partition in memory on SparkSQL

2015-03-19 Thread Judy Nash
Thanks Cheng for replying. Meant to say to change number of partitions of a cached table. It doesn’t need to be re-adjusted after caching. To provide more context: What I am seeing on my dataset is that we have a large number of tasks. Since it appears each task is mapped to a partition, I

MLlib Spam example gets stuck in Stage X

2015-03-19 Thread Su She
Hello Everyone, I am trying to run this MLlib example from Learning Spark: https://github.com/databricks/learning-spark/blob/master/src/main/scala/com/oreilly/learningsparkexamples/scala/MLlib.scala#L48 Things I'm doing differently: 1) Using spark shell instead of an application 2) instead of

Need some help on the Spark performance on Hadoop Yarn

2015-03-19 Thread Yi Ming Huang
Dear Spark experts, I appreciate you can look into my problem and give me some help and suggestions here... Thank you! I have a simple Spark application to parse and analyze the log, and I can run it on my hadoop yarn cluster. The problem with me is that I find it runs quite slow on the cluster,

Re: MLlib Spam example gets stuck in Stage X

2015-03-19 Thread Akhil Das
Can you see where exactly it is spending time? Like you said it goes to Stage 2, then you will be able to see how much time it spend on Stage 1. See if its a GC time, then try increasing the level of parallelism or repartition it like sc.getDefaultParallelism*3. Thanks Best Regards On Thu, Mar

Reliable method/tips to solve dependency issues?

2015-03-19 Thread Jim Kleckner
Do people have a reliable/repeatable method for solving dependency issues or tips? The current world of spark-hadoop-hbase-parquet-... is very challenging given the huge footprint of dependent packages and we may be pushing against the limits of how many packages can be combined into one

Cloudant as Spark SQL External Datastore on Spark 1.3.0

2015-03-19 Thread Yang Lei
Check this out : https://github.com/cloudant/spark-cloudant. It supports both the DataFrame and SQL approach for reading data from Cloudant and save it . Looking forward to your feedback on the project. Yang

FetchFailedException: Adjusted frame length exceeds 2147483647: 12716268407 - discarded

2015-03-19 Thread roni
I get 2 types of error - -org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0 and FetchFailedException: Adjusted frame length exceeds 2147483647: 12716268407 - discarded Spar keeps re-trying to submit the code and keeps getting this error. My file on

Catching InvalidClassException in sc.objectFile

2015-03-19 Thread Justin Yip
Hello, I have persisted a RDD[T] to disk through saveAsObjectFile. Then I changed the implementation of T. When I read the file with sc.objectFile using the new binary, I got the exception of java.io.InvalidClassException, which is expected. I try to catch this error via SparkException in the

Re: Upgrade from Spark 1.1.0 to 1.1.1+ Issues

2015-03-19 Thread Eason Hu
Hi Akhil, Thank you for your help. I just found that the problem is related to my local spark application, since I ran it in IntelliJ and I didn't reload the project after I recompile the jar via maven. If I didn't reload, it will use some local cache data to run the application which leads to

Timeout Issues from Spark 1.2.0+

2015-03-19 Thread EH
Hi all, I'm trying to run the sample Spark application in version v1.2.0 and above. However, I've encountered a weird issue like below. This issue only be seen in v1.2.0 and above, but v1.1.0 and v1.1.1 are fine. The sample code: val sc : SparkContext = new SparkContext(conf) val

Re: MLlib Spam example gets stuck in Stage X

2015-03-19 Thread Su She
Hi Akhil, 1) How could I see how much time it is spending on stage 1? Or what if, like above, it doesn't get past stage 1? 2) How could I check if its a GC time? and where would I increase the parallelism for the model? I have a Spark Master and 2 Workers running on CDH 5.3...what would the

how to specify multiple masters in sbin/start-slaves.sh script?

2015-03-19 Thread sequoiadb
Hey guys, Not sure if i’m the only one got this. We are building high-available standalone spark env. We are using ZK with 3 masters in the cluster. However, in sbin/start-slaves.sh, it calls start-slave.sh for each member in conf/slaves file, and specify master using $SPARK_MASTER_IP and

Re: SparkSQL 1.3.0 JDBC data source issues

2015-03-19 Thread Pei-Lun Lee
JIRA and PR for first issue: https://issues.apache.org/jira/browse/SPARK-6408 https://github.com/apache/spark/pull/5087 On Thu, Mar 19, 2015 at 12:20 PM, Pei-Lun Lee pl...@appier.com wrote: Hi, I am trying jdbc data source in spark sql 1.3.0 and found some issues. First, the syntax where

Re: LZO configuration can not affect

2015-03-19 Thread Ted Yu
How did you generate the Hadoop-lzo jar ? Thanks On Mar 17, 2015, at 2:36 AM, 唯我者 878223...@qq.com wrote: hi,everybody: I have configured the env about LZO like this: 9da01...@a75e774d.bbf50755.jpg 54346...@a75e774d.bbf50755.jpg But when I execute code with spark-shell

RE: Column Similarity using DIMSUM

2015-03-19 Thread Manish Gupta 8
Hi Reza, Behavior: · I tried running the job with different thresholds - 0.1, 0.5, 5, 20 100. Every time, the job got stuck at mapPartitionsWithIndex at RowMatrix.scala:522http://del2l379java.sapient.com:8088/proxy/application_1426267549766_0101/stages/stage?id=118attempt=0 with all

OutOfMemoryError during reduce tasks

2015-03-19 Thread Balazs Meszaros
Hi, I am trying to evaluate performance aspects of Spark in respect to various memory settings. What makes it more difficult is that I'm new to Python, but the problem at hand doesn't seem to originate from that. I'm running a wordcount script [1] with different amounts of input data. There

Re: Apache Spark User List: people's responses not showing in the browser view

2015-03-19 Thread Ted Yu
There might be some delay: http://search-hadoop.com/m/JW1q5mjZUy/Spark+people%2527s+responsessubj=Apache+Spark+User+List+people+s+responses+not+showing+in+the+browser+view On Mar 18, 2015, at 4:47 PM, Dmitry Goldenberg dgoldenberg...@gmail.com wrote: Thanks, Ted. Well, so far even there

Re: MLlib Spam example gets stuck in Stage X

2015-03-19 Thread Akhil Das
To get these metrics out, you need to open the driver ui running on port 4040. And in there you will see Stages information and for each stage you can see how much time it is spending on GC etc. In your case, the parallelism seems 4, the more # of parallelism the more # of tasks you will see.

Re: Null pointer exception reading Parquet

2015-03-19 Thread Akhil Das
How are you running the application? Can you try running the same inside spark-shell? Thanks Best Regards On Wed, Mar 18, 2015 at 10:51 PM, sprookie cug12...@gmail.com wrote: Hi All, I am using Saprk version 1.2 running locally. When I try to read a paquet file I get below exception, what

R: iPython Notebook + Spark + Accumulo -- best practice?

2015-03-19 Thread Paolo Platter
Yes, I would suggest spark-notebook too. It's very simple to setup and it's growing pretty fast. Paolo Inviata dal mio Windows Phone Da: Irfan Ahmadmailto:ir...@cloudphysics.com Inviato: ‎19/‎03/‎2015 04:05 A: davidhmailto:dav...@annaisystems.com Cc:

Saprk 1.2.0 | Spark job fails with MetadataFetchFailedException

2015-03-19 Thread Aniket Bhatnagar
I have a job that sorts data and runs a combineByKey operation and it sometimes fails with the following error. The job is running on spark 1.2.0 cluster with yarn-client deployment mode. Any clues on how to debug the error? org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output

Reading a text file into RDD[Char] instead of RDD[String]

2015-03-19 Thread Michael Lewis
Hi, I’m struggling to think of the best way to read a text file into an RDD[Char] rather than [String] I can do: sc.textFile(….) which gives me the Rdd[String], Can anyone suggest the most efficient way to create the RDD[Char] ? I’m sure I’ve missed something simple… Regards, Mike

calculating TF-IDF for large 100GB dataset problems

2015-03-19 Thread sergunok
Hi, I try to vectorize on yarn cluster corpus of texts (about 500K texts in 13 files - 100GB totally) located in HDFS . This process already token about 20 hours on 3 node cluster with 6 cores, 20GB RAM on each node. In my opinion it's to long :-) I started the task with the following command:

Re: Does newly-released LDA (Latent Dirichlet Allocation) algorithm supports ngrams?

2015-03-19 Thread Charles Earl
Heszak, I have only glanced at it but you should be able to incorporate tokens approximating n-gram yourself, say by using the lucene ShingleAnalyzerWrapper API http://lucene.apache.org/core/4_9_0/analyzers-common/org/apache/lucene/analysis/shingle/ShingleAnalyzerWrapper.html You might also take a

RE: Spark SQL Self join with agreegate

2015-03-19 Thread Cheng, Hao
Not so sure your intention, but something like SELECT sum(val1), sum(val2) FROM table GROUP BY src, dest ? -Original Message- From: Shailesh Birari [mailto:sbirar...@gmail.com] Sent: Friday, March 20, 2015 9:31 AM To: user@spark.apache.org Subject: Spark SQL Self join with agreegate

Re: LZO configuration can not affect

2015-03-19 Thread Ted Yu
jeanlyn92: I was not very clear in previous reply: I meant to refer to /home/hadoop/mylib/hadoop-lzo-SNAPSHOT.jar But looks like the distro includes hadoop-lzo-0.4.15.jar Cheers On Thu, Mar 19, 2015 at 6:26 PM, jeanlyn92 jeanly...@gmail.com wrote: That's not enough .The config must appoint

Spark SQL Self join with agreegate

2015-03-19 Thread Shailesh Birari
Hello, I want to use Spark sql to aggregate some columns of the data. e.g. I have huge data with some columns as: time, src, dst, val1, val2 I want to calculate sum(val1) and sum(val2) for all unique pairs of src and dst. I tried by forming SQL query SELECT a.time, a.src, a.dst,

RE: Column Similarity using DIMSUM

2015-03-19 Thread Manish Gupta 8
Thanks Reza. It makes perfect sense. Regards, Manish From: Reza Zadeh [mailto:r...@databricks.com] Sent: Thursday, March 19, 2015 11:58 PM To: Manish Gupta 8 Cc: user@spark.apache.org Subject: Re: Column Similarity using DIMSUM Hi Manish, With 56431 columns, the output can be as large as 56431

Re: Software stack for Recommendation engine with spark mlib

2015-03-19 Thread Shashidhar Rao
Hi , Just 2 follow up questions, please suggest 1. Is there any commercial recommendation engine apart from the open source tools(Mahout,Spark) that are available that anybody can suggest ? 2. In this case only the purchase transaction is captured. There are no ratings and no feedback

Re: KMeans with large clusters Java Heap Space

2015-03-19 Thread mvsundaresan
Thanks Derrick, when I count the unique terms it is very small. So I added this... val tfidf_features = lines.flatMap(x = x._2.split( ).filter(_.length 2)).distinct().count().toInt val hashingTF = new HashingTF(tfidf_features) -- View this message in context:

Spark MLLib KMeans Top Terms

2015-03-19 Thread mvsundaresan
I'm trying to cluster short text messages using KMeans, after trained the kmeans I want to get the top terms (5 - 10). How do I get that using clusterCenters? full code is here http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-with-large-clusters-Java-Heap-Space-td21432.html -- View

Re: Can LBFGS be used on streaming data?

2015-03-19 Thread Jeremy Freeman
Regarding the first question, can you say more about how you are loading your data? And what is the size of the data set? And is that the only error you see, and do you only see it in the streaming version? For the second question, there are a couple reasons the weights might slightly differ,

Re: HDP 2.2 AM abort : Unable to find ExecutorLauncher class

2015-03-19 Thread Bharath Ravi Kumar
Hi Doug, I did try setting that config parameter to a larger number (several minutes), but still wasn't able to retrieve additional context logs. Let us know if you have any success with it. Thanks, Bharath On Fri, Mar 20, 2015 at 3:21 AM, Doug Balog doug.sparku...@dugos.com wrote: I’m seeing

Launching Spark Cluster Application through IDE

2015-03-19 Thread raggy
I am trying to debug a Spark Application on a cluster using a master and several worker nodes. I have been successful at setting up the master node and worker nodes using Spark standalone cluster manager. I downloaded the spark folder with binaries and use the following commands to setup worker

Re: iPython Notebook + Spark + Accumulo -- best practice?

2015-03-19 Thread David Holiday
kk - I'll put something together and get back to you with more :-) DAVID HOLIDAY Software Engineer 760 607 3300 | Office 312 758 8385 | Mobile dav...@annaisystems.commailto:broo...@annaisystems.com [cid:AE39C43E-3FF7-4C90-BCE4-9711C84C4CB8@cld.annailabs.com]

Re: Issues with SBT and Spark

2015-03-19 Thread Sean Owen
No, Spark is cross-built for 2.11 too, and those are the deps being pulled in here. This really does however sounds like a Scala 2.10 vs 2.11 mismatch. Check that, for example, your cluster is using the same build of Spark and that you did not package Spark with your app On Thu, Mar 19, 2015 at

Re: Issues with SBT and Spark

2015-03-19 Thread Masf
Hi Spark 1.2.1 uses Scala 2.10. Because of this, your program fails with scala 2.11 Regards On Thu, Mar 19, 2015 at 8:17 PM, Vijayasarathy Kannan kvi...@vt.edu wrote: My current simple.sbt is name := SparkEpiFast version := 1.0 scalaVersion := 2.11.4 libraryDependencies +=

Re: Spark SQL filter DataFrame by date?

2015-03-19 Thread Yin Huai
Can you add your code snippet? Seems it's missing in the original email. Thanks, Yin On Thu, Mar 19, 2015 at 3:22 PM, kamatsuoka ken...@gmail.com wrote: I'm trying to filter a DataFrame by a date column, with no luck so far. Here's what I'm doing: When I run reqs_day.count() I get zero,

Re: Apache Spark User List: people's responses not showing in the browser view

2015-03-19 Thread Nicholas Chammas
Nabble is a third-party site that tries its best to archive mail sent out over the list. Nothing guarantees it will be in sync with the real mailing list. To get the truth on what was sent over this, Apache-managed list, you unfortunately need to go the Apache archives:

Re: Reading a text file into RDD[Char] instead of RDD[String]

2015-03-19 Thread Sean Owen
val s = sc.parallelize(Array(foo, bar, baz)) val c = s.flatMap(_.toIterator) c.collect() res8: Array[Char] = Array(f, o, o, b, a, r, b, a, z) On Thu, Mar 19, 2015 at 8:46 AM, Michael Lewis lewi...@me.com wrote: Hi, I’m struggling to think of the best way to read a text file into an RDD[Char]

Re: LZO configuration can not affect

2015-03-19 Thread Ted Yu
If I read the screenshot correctly, Hadoop lzo jar is under /home/hadoop/mylib Cheers On Mar 19, 2015, at 5:37 AM, jeanlyn92 jeanly...@gmail.com wrote: You should conf as follow: export SPARK_LIBRARY_PATH=$HADOOP_HOME/lib/native:$HADOOP_HOME/share/hadoop/common/lib/hadoop-lzo-0.4.15.jar

RE: Why I didn't see the benefits of using KryoSerializer

2015-03-19 Thread java8964
I read the Spark code a little bit, trying to understand my own question. It looks like the different is really between org.apache.spark.serializer.JavaSerializer and org.apache.spark.serializer.KryoSerializer, both having the method named writeObject. In my test case, for each line of my text

Re: DataFrame operation on parquet: GC overhead limit exceeded

2015-03-19 Thread Yiannis Gkoufas
Hi Yin, thanks a lot for that! Will give it a shot and let you know. On 19 March 2015 at 16:30, Yin Huai yh...@databricks.com wrote: Was the OOM thrown during the execution of first stage (map) or the second stage (reduce)? If it was the second stage, can you increase the value of

Problems with spark.akka.frameSize

2015-03-19 Thread Vijayasarathy Kannan
Hi, I am encountering the following error with a Spark application. Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Serialized task 0:0 was 11257268 bytes, which exceeds max allowed: spark.akka.frameSize (10485760 bytes) - reserved (204800 bytes).

Load balancing

2015-03-19 Thread Mohit Anchlia
I am trying to understand how to load balance the incoming data to multiple spark streaming workers. Could somebody help me understand how I can distribute my incoming data from various sources such that incoming data is going to multiple spark streaming nodes? Is it done by spark client with help

Re: iPython Notebook + Spark + Accumulo -- best practice?

2015-03-19 Thread David Holiday
hi all - thx for the alacritous replies! so regarding how to get things from notebook to spark and back, am I correct that spark-submit is the way to go? DAVID HOLIDAY Software Engineer 760 607 3300 | Office 312 758 8385 | Mobile dav...@annaisystems.commailto:broo...@annaisystems.com

Re: Column Similarity using DIMSUM

2015-03-19 Thread Reza Zadeh
Hi Manish, With 56431 columns, the output can be as large as 56431 x 56431 ~= 3bn. When a single row is dense, that can end up overwhelming a machine. You can push that up with more RAM, but note that DIMSUM is meant for tall and skinny matrices: so it scales linearly and across cluster with rows,

Re: Apache Spark User List: people's responses not showing in the browser view

2015-03-19 Thread Ted Yu
I prefer using search-hadoop.com which provides better search capability. Cheers On Thu, Mar 19, 2015 at 6:48 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Nabble is a third-party site that tries its best to archive mail sent out over the list. Nothing guarantees it will be in sync

Writing Spark Streaming Programs

2015-03-19 Thread James King
Hello All, I'm using Spark for streaming but I'm unclear one which implementation language to use Java, Scala or Python. I don't know anything about Python, familiar with Scala and have been doing Java for a long time. I think the above shouldn't influence my decision on which language to use

Re: Apache Spark User List: people's responses not showing in the browser view

2015-03-19 Thread Nicholas Chammas
Sure, you can use Nabble or search-hadoop or whatever you prefer. My point is just that the source of truth are the Apache archives, and these other sites may or may not be in sync with that truth. On Thu, Mar 19, 2015 at 10:20 AM Ted Yu yuzhih...@gmail.com wrote: I prefer using

Re: Apache Spark User List: people's responses not showing in the browser view

2015-03-19 Thread Nicholas Chammas
Yes, that is mostly why these third-party sites have sprung up around the official archives--to provide better search. Did you try the link Ted posted? On Thu, Mar 19, 2015 at 10:49 AM Dmitry Goldenberg dgoldenberg...@gmail.com wrote: It seems that those archives are not necessarily easy to

Re: Apache Spark User List: people's responses not showing in the browser view

2015-03-19 Thread Dmitry Goldenberg
It seems that those archives are not necessarily easy to find stuff in. Is there a search engine on top of them? so as to find e.g. your own posts easily? On Thu, Mar 19, 2015 at 10:34 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Sure, you can use Nabble or search-hadoop or whatever

Re: Apache Spark User List: people's responses not showing in the browser view

2015-03-19 Thread Dmitry Goldenberg
Interesting points. Yes I just tried http://search-hadoop.com/m/JW1q5mjZUy/Spark+people%2527s+responsessubj=Apache+Spark+User+List+people+s+responses+not+showing+in+the+browser+view and I see responses there now. I believe Ted was right in that, there's a delay before they show up there (probably

Re: Apache Spark User List: people's responses not showing in the browser view

2015-03-19 Thread Ted Yu
Here is the reason on why results on search site may be delayed, especially for Apache JIRAs. If they crawl too often, Apache would flag the bot and blacklist it. Cheers On Thu, Mar 19, 2015 at 7:59 AM, Dmitry Goldenberg dgoldenberg...@gmail.com wrote: Interesting points. Yes I just tried

Re: Writing Spark Streaming Programs

2015-03-19 Thread Gerard Maas
Try writing this Spark Streaming idiom in Java and you'll choose Scala soon enough: dstream.foreachRDD{rdd = rdd.foreachPartition( partition = ) } When deciding between Java and Scala for Spark, IMHO Scala has the upperhand. If you're concerned with readability, have a look at the Scala

Re: spark there is no space on the disk

2015-03-19 Thread Davies Liu
Is it possible that `spark.local.dir` is overriden by others? The docs say: NOTE: In Spark 1.0 and later this will be overriden by SPARK_LOCAL_DIRS (Standalone, Mesos) or LOCAL_DIRS (YARN) On Sat, Mar 14, 2015 at 5:29 PM, Peng Xia sparkpeng...@gmail.com wrote: Hi Sean, Thank very much for

Re: Error when using multiple python files spark-submit

2015-03-19 Thread Davies Liu
the options of spark-submit should come before main.py, or they will become the options of main.py, so it should be: ../hadoop/spark-install/bin/spark-submit --py-files /home/poiuytrez/naive.py,/home/poiuytrez/processing.py,/home/poiuytrez/settings.py --master spark://spark-m:7077 main.py

Spark SQL filter DataFrame by date?

2015-03-19 Thread kamatsuoka
I'm trying to filter a DataFrame by a date column, with no luck so far. Here's what I'm doing: When I run reqs_day.count() I get zero, apparently because my date parameter gets translated to 16509. Is this a bug, or am I doing it wrong? -- View this message in context:

Re: Spark 1.3 createDataframe error with pandas df

2015-03-19 Thread Davies Liu
On Mon, Mar 16, 2015 at 6:23 AM, kevindahl kevin.d...@gmail.com wrote: kevindahl wrote I'm trying to create a spark data frame from a pandas data frame, but for even the most trivial of datasets I get an error along the lines of this:

Spark Streaming custom receiver for local data

2015-03-19 Thread MartijnD
We are building a wrapper that makes it possible to use reactive streams (i.e. Observable, see reactivex.io) as input to Spark Streaming. We therefore tried to create a custom receiver for Spark. However, the Observable lives at the driver program and is generally not serializable. Is it possible

Re: Spark-submit and multiple files

2015-03-19 Thread Davies Liu
You could submit additional Python source via --py-files , for example: $ bin/spark-submit --py-files work.py main.py On Tue, Mar 17, 2015 at 3:29 AM, poiuytrez guilla...@databerries.com wrote: Hello guys, I am having a hard time to understand how spark-submit behave with multiple files. I

Issues with SBT and Spark

2015-03-19 Thread Vijayasarathy Kannan
My current simple.sbt is name := SparkEpiFast version := 1.0 scalaVersion := 2.11.4 libraryDependencies += org.apache.spark % spark-core_2.11 % 1.2.1 % provided libraryDependencies += org.apache.spark % spark-graphx_2.11 % 1.2.1 % provided While I do sbt package, it compiles successfully.

Re: spark there is no space on the disk

2015-03-19 Thread Marcelo Vanzin
IIRC you have to set that configuration on the Worker processes (for standalone). The app can't override it (only for a client-mode driver). YARN has a similar configuration, but I don't know the name (shouldn't be hard to find, though). On Thu, Mar 19, 2015 at 11:56 AM, Davies Liu

Re: spark there is no space on the disk

2015-03-19 Thread Ted Yu
For YARN, possibly this one ? property nameyarn.nodemanager.local-dirs/name value/hadoop/yarn/local/value /property Cheers On Thu, Mar 19, 2015 at 2:21 PM, Marcelo Vanzin van...@cloudera.com wrote: IIRC you have to set that configuration on the Worker processes (for

Re: HDP 2.2 AM abort : Unable to find ExecutorLauncher class

2015-03-19 Thread Doug Balog
I’m seeing the same problem. I’ve set logging to DEBUG, and I think some hints are in the “Yarn AM launch context” that is printed out before Yarn runs java. My next step is to talk to the admins and get them to set yarn.nodemanager.delete.debug-delay-sec in the config, as recommended in

Re: Spark + Kafka

2015-03-19 Thread James King
Many thanks all for the good responses, appreciated. On Thu, Mar 19, 2015 at 8:36 AM, James King jakwebin...@gmail.com wrote: Thanks Khanderao. On Wed, Mar 18, 2015 at 7:18 PM, Khanderao Kand Gmail khanderao.k...@gmail.com wrote: I have used various version of spark (1.0, 1.2.1) without

Re: saveAsTable broken in v1.3 DataFrames?

2015-03-19 Thread Yin Huai
I meant table properties and serde properties are used to store metadata of a Spark SQL data source table. We do not set other fields like SerDe lib. For a user, the output of DESCRIBE EXTENDED/FORMATTED on a data source table should not show unrelated stuff like Serde lib and InputFormat. I have

JAVA_HOME problem with upgrade to 1.3.0

2015-03-19 Thread Williams, Ken
I’m trying to upgrade a Spark project, written in Scala, from Spark 1.2.1 to 1.3.0, so I changed my `build.sbt` like so: -libraryDependencies += org.apache.spark %% spark-core % 1.2.1 % “provided +libraryDependencies += org.apache.spark %% spark-core % 1.3.0 % provided then make an

saveAsTable broken in v1.3 DataFrames?

2015-03-19 Thread Christian Perez
Hi all, DataFrame.saveAsTable creates a managed table in Hive (v0.13 on CDH5.3.2) in both spark-shell and pyspark, but creates the *wrong* schema _and_ storage format in the Hive metastore, so that the table cannot be read from inside Hive. Spark itself can read the table, but Hive throws a

Re: [SQL] Elasticsearch-hadoop, exception creating temporary table

2015-03-19 Thread Todd Nist
Thanks for the assistance, I found the error it wan something I had donep; PEBCAK. I had placed a version of the elasticsearch-hadoop.2.1.0.BETA3 in the project/lib directory causing it to be managed dependency and being brought in first, even though the build.sbt had the correct version

Re: saveAsTable broken in v1.3 DataFrames?

2015-03-19 Thread Yin Huai
Hi Christian, Your table is stored correctly in Parquet format. For saveAsTable, the table created is *not* a Hive table, but a Spark SQL data source table ( http://spark.apache.org/docs/1.3.0/sql-programming-guide.html#data-sources). We are only using Hive's metastore to store the metadata (to

Re: JAVA_HOME problem with upgrade to 1.3.0

2015-03-19 Thread Ted Yu
JAVA_HOME, an environment variable, should be defined on the node where appattempt_1420225286501_4699_02 ran. Cheers On Thu, Mar 19, 2015 at 8:59 AM, Williams, Ken ken.willi...@windlogics.com wrote: I’m trying to upgrade a Spark project, written in Scala, from Spark 1.2.1 to 1.3.0, so I

Re: Writing Spark Streaming Programs

2015-03-19 Thread James King
Many thanks Gerard, this is very helpful. Cheers! On Thu, Mar 19, 2015 at 4:02 PM, Gerard Maas gerard.m...@gmail.com wrote: Try writing this Spark Streaming idiom in Java and you'll choose Scala soon enough: dstream.foreachRDD{rdd = rdd.foreachPartition( partition = ) } When

Re: Writing Spark Streaming Programs

2015-03-19 Thread Emre Sevinc
Hello James, I've been working with Spark Streaming for the last 6 months, and I'm coding in Java 7. Even though I haven't encountered any blocking issues with that combination, I'd definitely pick Scala if the decision was up to me. I agree with Gerard and Charles on this one. If you can, go

Re: Writing Spark Streaming Programs

2015-03-19 Thread Charles Feduke
Scala is the language used to write Spark so there's never a situation in which features introduced in a newer version of Spark cannot be taken advantage of if you write your code in Scala. (This is mostly true of Java, but it may be a little more legwork if a Java-friendly adapter isn't available

Re: Writing Spark Streaming Programs

2015-03-19 Thread Jeffrey Jedele
I second what has been said already. We just built a streaming app in Java and I would definitely choose Scala this time. Regards, Jeff 2015-03-19 16:34 GMT+01:00 Emre Sevinc emre.sev...@gmail.com: Hello James, I've been working with Spark Streaming for the last 6 months, and I'm coding in

Re: DataFrame operation on parquet: GC overhead limit exceeded

2015-03-19 Thread Yin Huai
Was the OOM thrown during the execution of first stage (map) or the second stage (reduce)? If it was the second stage, can you increase the value of spark.sql.shuffle.partitions and see if the OOM disappears? This setting controls the number of reduces Spark SQL will use and the default is 200.

Re: saveAsTable broken in v1.3 DataFrames?

2015-03-19 Thread Christian Perez
Hi Yin, Thanks for the clarification. My first reaction is that if this is the intended behavior, it is a wasted opportunity. Why create a managed table in Hive that cannot be read from inside Hive? I think I understand now that you are essentially piggybacking on Hive's metastore to persist

Re: JAVA_HOME problem with upgrade to 1.3.0

2015-03-19 Thread Williams, Ken
From: Ted Yu yuzhih...@gmail.commailto:yuzhih...@gmail.com Date: Thursday, March 19, 2015 at 11:05 AM JAVA_HOME, an environment variable, should be defined on the node where appattempt_1420225286501_4699_02 ran. Has this behavior changed in 1.3.0 since 1.2.1 though? Using 1.2.1 and