Re: Microsoft SQL jdbc support from spark sql

2015-04-07 Thread Denny Lee
At this time, the JDBC Data source is not extensible so it cannot support SQL Server. There was some thoughts - credit to Cheng Lian for this - about making the JDBC data source extensible for third party support possibly via slick. On Mon, Apr 6, 2015 at 10:41 PM bipin bipin@gmail.com

Regarding benefits of using more than one cpu for a task in spark

2015-04-07 Thread twinkle sachdeva
Hi, In spark, there are two settings regarding number of cores, one is at task level :spark.task.cpus and there is another one, which drives number of cores per executors: spark.executor.cores Apart from using more than one core for a task which has to call some other external API etc, is there

Re: Spark Application Stages and DAG

2015-04-07 Thread Vijay Innamuri
My Spark streaming application processes the data received in each interval. In Spark Stages UI, all the stages are pointed to single line of code* windowDStream.foreachRDD* only (not the actions inside the DStream) - Following is the information from Spark Stages UI page: Stage Id

Re: task not serialize

2015-04-07 Thread Jeetendra Gangele
Lets say I follow below approach and I got RddPair with huge size .. which can not fit into one machine ... what to run foreach on this RDD? On 7 April 2015 at 04:25, Jeetendra Gangele gangele...@gmail.com wrote: On 7 April 2015 at 04:03, Dean Wampler deanwamp...@gmail.com wrote: On Mon,

Re: Microsoft SQL jdbc support from spark sql

2015-04-07 Thread Bipin Nag
Thanks for the information. Hopefully this will happen in near future. For now my best bet would be to export data and import it in spark sql. On 7 April 2015 at 11:28, Denny Lee denny.g@gmail.com wrote: At this time, the JDBC Data source is not extensible so it cannot support SQL Server.

RE: Spark 1.2.0 with Play/Activator

2015-04-07 Thread Manish Gupta 8
Thanks for the information Andy. I will go through the versions mentioned in Dependencies.scala to identify the compatibility. Regards, Manish From: andy petrella [mailto:andy.petre...@gmail.com] Sent: Tuesday, April 07, 2015 11:04 AM To: Manish Gupta 8; user@spark.apache.org Subject: Re:

RE: Spark 1.2.0 with Play/Activator

2015-04-07 Thread Manish Gupta 8
If I try to build spark-notebook with spark.version=1.2.0-cdh5.3.0, sbt throw these warnings before failing to compile: :: org.apache.spark#spark-yarn_2.10;1.2.0-cdh5.3.0: not found :: org.apache.spark#spark-repl_2.10;1.2.0-cdh5.3.0: not found Any suggestions? Thanks From: Manish Gupta 8

Re: Spark 1.2.0 with Play/Activator

2015-04-07 Thread andy petrella
Mmmh, you want it running a spark 1.2 with hadoop 2.5.0-cdh5.3.2 right? If I'm not wrong you might have to launch it like so: ``` sbt -Dspark.version=1.2.0 -Dhadoop.version=2.5.0-cdh5.3.2 ``` Or you can download it from http://spark-notebook.io if you want. HTH andy On Tue, Apr 7, 2015 at

Re: Strategy regarding maximum number of executor's failure for log running jobs/ spark streaming jobs

2015-04-07 Thread twinkle sachdeva
Hi, One of the rational behind killing the app can be to avoid skewness in data. I have created this issue (https://issues.apache.org/jira/browse/SPARK-6735) to provide options for disabling this behaviour, as well as making the number of executor's failure to be relative with respect to a

[DAGSchedule][OutputCommitCoordinator] OutputCommitCoordinator.authorizedCommittersByStage Map Out Of Memory

2015-04-07 Thread Tao Li
Hi all: I am using spark streaming(1.3.1) as a long time running service and out of memory after running for 7 days. I found that the field *authorizedCommittersByStage* in *OutputCommitCoordinator* class cause the OOM. authorizedCommittersByStage is a map, key is StageId, value is

Can not get executor's Log from Spark's History Server

2015-04-07 Thread donhoff_h
Hi, Experts I run my Spark Cluster on Yarn. I used to get executors' Logs from Spark's History Server. But after I started my Hadoop jobhistory server and made configuration to aggregate logs of hadoop jobs to a HDFS directory, I found that I could not get spark's executors' Logs any more. Is

Re: Processing Large Images in Spark?

2015-04-07 Thread Steve Loughran
On 6 Apr 2015, at 23:05, Patrick Young patrick.mckendree.yo...@gmail.commailto:patrick.mckendree.yo...@gmail.com wrote: does anyone have any thoughts on storing a really large raster in HDFS? Seems like if I just dump the image into HDFS as it, it'll get stored in blocks all across the

Re: Spark streaming with Kafka- couldnt find KafkaUtils

2015-04-07 Thread Felix C
Or you could build an uber jar ( you could google that ) https://eradiating.wordpress.com/2015/02/15/getting-spark-streaming-on-kafka-to-work/ --- Original Message --- From: Akhil Das ak...@sigmoidanalytics.com Sent: April 4, 2015 11:52 PM To: Priya Ch learnings.chitt...@gmail.com Cc:

[GraphX] aggregateMessages with active set

2015-04-07 Thread James
Hello, The old api of GraphX mapReduceTriplets has an optional parameter activeSetOpt: Option[(VertexRDD[_] that limit the input of sendMessage. However, to the new api aggregateMessages I could not find this option, why it does not offer any more? Alcaid

Re: Processing Large Images in Spark?

2015-04-07 Thread andy petrella
Heya, You might be interesting at looking at GeoTrellis They use RDDs of Tiles to process big images like Landsat ones can be (specially 8). However, I see you have only 1G per file, so I guess you only care of a single band? Or is it a reboxed pic? Note: I think the GeoTrellis image format is

'Java heap space' error occured when query 4G data file from HDFS

2015-04-07 Thread 李铖
In my dev-test env .I have 3 virtual machines ,every machine have 12G memory,8 cpu core. Here is spark-defaults.conf,and spark-env.sh.Maybe some config is not right. I run this command :*spark-submit --master yarn-client --driver-memory 7g --executor-memory 6g /home/hadoop/spark/main.py*

Issue with pyspark 1.3.0, sql package and rows

2015-04-07 Thread Stefano Parmesan
Hi all, I've already opened a bug on Jira some days ago [1] but I'm starting thinking this is not the correct way to go since I haven't got any news about it yet. Let me try to explain it briefly: with pyspark, trying to cogroup two input files with different schemas lead (nondeterministically)

scala.MatchError: class org.apache.avro.Schema (of class java.lang.Class)

2015-04-07 Thread Yamini
Using spark(1.2) streaming to read avro schema based topics flowing in kafka and then using spark sql context to register data as temp table. Avro maven plugin(1.7.7 version) generates the java bean class for the avro file but includes a field named SCHEMA$ of type org.apache.avro.Schema which is

Difference between textFile Vs hadoopFile (textInoutFormat) on HDFS data

2015-04-07 Thread Puneet Kumar Ojha
Hi , Is there any difference between Difference between textFile Vs hadoopFile (textInoutFormat) when data is present in HDFS? Will there be any performance gain that can be observed? Puneet Kumar Ojha Data Architect | PubMatichttp://www.pubmatic.com/

Pipelines for controlling workflow

2015-04-07 Thread Staffan
Hi, I am building a pipeline and I've read most that I can find on the topic (spark.ml library and the AMPcamp version of pipelines: http://ampcamp.berkeley.edu/5/exercises/image-classification-with-pipelines.html). I do not have structured data as in the case of the new Spark.ml library which

Re: Difference between textFile Vs hadoopFile (textInoutFormat) on HDFS data

2015-04-07 Thread Nick Pentreath
There is no difference - textFile calls hadoopFile with a TextInputFormat, and maps each value to a String.  — Sent from Mailbox On Tue, Apr 7, 2015 at 1:46 PM, Puneet Kumar Ojha puneet.ku...@pubmatic.com wrote: Hi , Is there any difference between Difference between textFile Vs hadoopFile

Does spark utilize the sorted order of hbase keys, when using hbase as data source

2015-04-07 Thread Юра
Hello, guys! I am a newbie to Spark and would appreciate any advice or help. Here is the detailed question: http://stackoverflow.com/questions/29493472/does-spark-utilize-the-sorted-order-of-hbase-keys-when-using-hbase-as-data-sour Regards, Yury

Re: Does spark utilize the sorted order of hbase keys, when using hbase as data source

2015-04-07 Thread Юра
There are 500 millions distinct users... 2015-04-07 17:45 GMT+03:00 Ted Yu yuzhih...@gmail.com: How many distinct users are stored in HBase ? TableInputFormat produces splits where number of splits matches the number of regions in a table. You can write your own InputFormat which splits

Re: task not serialize

2015-04-07 Thread Dean Wampler
Foreach() runs in parallel across the cluster, like map, flatMap, etc. You'll only run into problems if you call collect(), which brings the entire RDD into memory in the driver program. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do

Advice using Spark SQL and Thrift JDBC Server

2015-04-07 Thread James Aley
Hello, First of all, thank you to everyone working on Spark. I've only been using it for a few weeks now but so far I'm really enjoying it. You saved me from a big, scary elephant! :-) I was wondering if anyone might be able to offer some advice about working with the Thrift JDBC server? I'm

Re: Does spark utilize the sorted order of hbase keys, when using hbase as data source

2015-04-07 Thread Ted Yu
How many distinct users are stored in HBase ? TableInputFormat produces splits where number of splits matches the number of regions in a table. You can write your own InputFormat which splits according to user id. FYI On Tue, Apr 7, 2015 at 7:36 AM, Юра rvaniy@gmail.com wrote: Hello,

Re: Microsoft SQL jdbc support from spark sql

2015-04-07 Thread Denny Lee
That's correct, at this time MS SQL Server is not supported through the JDBC data source at this time. In my environment, we've been using Hadoop streaming to extract out data from multiple SQL Servers, pushing the data into HDFS, creating the Hive tables and/or converting them into Parquet, and

Issue with pyspark 1.3.0, sql package and rows

2015-04-07 Thread Stefano Parmesan
Hi all, I've already opened a bug on Jira some days ago [1] but I'm starting thinking this is not the correct way to go since I haven't got any news about it yet. Let me try to explain it briefly: with pyspark, trying to cogroup two input files with different schemas lead (nondeterministically)

Re: Microsoft SQL jdbc support from spark sql

2015-04-07 Thread ARose
I am having the same issue with my java application. String url = jdbc:sqlserver:// + host + :1433;DatabaseName= + database + ;integratedSecurity=true; String driver = com.microsoft.sqlserver.jdbc.SQLServerDriver; SparkConf conf = new

RE: --driver-memory parameter doesn't work for spark-submmit on yarn?

2015-04-07 Thread Shuai Zheng
Sorry for reply late. I bypass this by set _JAVA_OPTIONS. And the ps aux | grep spark hadoop 14442 0.6 0.2 34334552 128560 pts/0 Sl+ 14:37 0:01 /usr/java/latest/bin/java org.apache.spark.deploy.SparkSubmitDriverBootstrapper --driver-memory=5G --executor-memory=10G --master yarn-client

RE: [BUG]Broadcast value return empty after turn to org.apache.spark.serializer.KryoSerializer

2015-04-07 Thread Shuai Zheng
I have found the issue, but I think it is bug. If I change my class to: public class ModelSessionBuilder implements Serializable { /** * */ . private Properties[] propertiesList; private

Incremently load big RDD file into Memory

2015-04-07 Thread mas
val locations = filelines.map(line = line.split(\t)).map(t = (t(5).toLong, (t(2).toDouble, t(3).toDouble))).distinct().collect() val cartesienProduct=locations.cartesian(locations).map(t= Edge(t._1._1,t._2._1,distanceAmongPoints(t._1._2._1,t._1._2._2,t._2._2._1,t._2._2._2))) Code executes

Re: Does spark utilize the sorted order of hbase keys, when using hbase as data source

2015-04-07 Thread Ted Yu
Then splitting according to user id's is out of the question :-) On Tue, Apr 7, 2015 at 8:12 AM, Юра rvaniy@gmail.com wrote: There are 500 millions distinct users... 2015-04-07 17:45 GMT+03:00 Ted Yu yuzhih...@gmail.com: How many distinct users are stored in HBase ? TableInputFormat

Re: A problem with Spark 1.3 artifacts

2015-04-07 Thread Marcelo Vanzin
BTW, just out of curiosity, I checked both the 1.3.0 release assembly and the spark-core_2.10 artifact downloaded from http://mvnrepository.com/, and neither contain any references to anything under org.eclipse (all referenced jetty classes are the shaded ones under org.spark-project.jetty). On

The differentce between SparkSql/DataFram join and Rdd join

2015-04-07 Thread Hao Ren
Hi, We have 2 hive tables and want to join one with the other. Initially, we ran a sql request on HiveContext. But it did not work. It was blocked on 30/600 tasks. Then we tried to load tables into two DataFrames, we have encountered the same problem. Finally, it works with RDD.join. What we

Re: task not serialize

2015-04-07 Thread Jeetendra Gangele
I thinking to follow the below approach(in my class hbase also return the same object which i will get in RDD) .1 First run the flatMapPairf JavaPairRDDVendorRecord, IterableVendorRecord pairvendorData =matchRdd.flatMapToPair( new PairFlatMapFunctionVendorRecord, VendorRecord, VendorRecord(){

Re: RDD collect hangs on large input data

2015-04-07 Thread Jon Chase
Zsolt - what version of Java are you running? On Mon, Mar 30, 2015 at 7:12 AM, Zsolt Tóth toth.zsolt@gmail.com wrote: Thanks for your answer! I don't call .collect because I want to trigger the execution. I call it because I need the rdd on the driver. This is not a huge RDD and it's not

FlatMapPair run for longer time

2015-04-07 Thread Jeetendra Gangele
Hi All I am running the below code and its running for very long time where input to flatMapTopair is record of 50K. and I am calling Hbase for 50K times just a range scan query to should not take time. can anybody guide me what is wrong here? JavaPairRDDVendorRecord, IterableVendorRecord

Re: Using DIMSUM with ids

2015-04-07 Thread Debasish Das
I have a version that works well for Netflix data but now I am validating on internal datasets..this code will work on matrix factors and sparse matrices that has rows = 100* columnsif columns are much smaller than rows then col based flow works well...basically we need both flows... I did

RE: Incremently load big RDD file into Memory

2015-04-07 Thread java8964
cartesian is an expensive operation. If you have 'M' records in location, then locations. cartesian(locations) will generate MxM result.If locations is a big RDD, it is hard to do the locations. cartesian(locations) efficiently.Yong Date: Tue, 7 Apr 2015 10:04:12 -0700 From:

Re: FlatMapPair run for longer time

2015-04-07 Thread Dean Wampler
It's hard for us to diagnose your performance problems, because we don't have your environment and fixing one will simply reveal the next one to be fixed. So, I suggest you use the following strategy to figure out what takes the most time and hence what you might try to optimize. Try replacing

Re: The differentce between SparkSql/DataFram join and Rdd join

2015-04-07 Thread Michael Armbrust
The joins here are totally different implementations, but it is worrisome that you are seeing the SQL join hanging. Can you provide more information about the hang? jstack of the driver and a worker that is processing a task would be very useful. On Tue, Apr 7, 2015 at 8:33 AM, Hao Ren

Re: Advice using Spark SQL and Thrift JDBC Server

2015-04-07 Thread Michael Armbrust
1) What exactly is the relationship between the thrift server and Hive? I'm guessing Spark is just making use of the Hive metastore to access table definitions, and maybe some other things, is that the case? Underneath the covers, the Spark SQL thrift server is executing queries using a

Array[T].distinct doesn't work inside RDD

2015-04-07 Thread anny9699
Hi, I have a question about Array[T].distinct on customized class T. My data is a like RDD[(String, Array[T])] in which T is a class written by my class. There are some duplicates in each Array[T] so I want to remove them. I override the equals() method in T and use val dataNoDuplicates =

Re: scala.MatchError: class org.apache.avro.Schema (of class java.lang.Class)

2015-04-07 Thread Yamini Maddirala
For more details on my question http://apache-spark-user-list.1001560.n3.nabble.com/How-to-generate-Java-bean-class-for-avro-files-using-spark-avro-project-tp22413.html Thanks, Yamini On Tue, Apr 7, 2015 at 2:23 PM, Yamini Maddirala yamini.m...@gmail.com wrote: Hi Michael, Yes, I did try

Re: [GraphX] aggregateMessages with active set

2015-04-07 Thread Ankur Dave
We thought it would be better to simplify the interface, since the active set is a performance optimization but the result is identical to calling subgraph before aggregateMessages. The active set option is still there in the package-private method aggregateMessagesWithActiveSet. You can actually

Re: scala.MatchError: class org.apache.avro.Schema (of class java.lang.Class)

2015-04-07 Thread Michael Armbrust
Have you looked at spark-avro? https://github.com/databricks/spark-avro On Tue, Apr 7, 2015 at 3:57 AM, Yamini yamini.m...@gmail.com wrote: Using spark(1.2) streaming to read avro schema based topics flowing in kafka and then using spark sql context to register data as temp table. Avro maven

Re: Can not get executor's Log from Spark's History Server

2015-04-07 Thread Marcelo Vanzin
The Spark history server does not have the ability to serve executor logs currently. You need to use the yarn logs command for that. On Tue, Apr 7, 2015 at 2:51 AM, donhoff_h 165612...@qq.com wrote: Hi, Experts I run my Spark Cluster on Yarn. I used to get executors' Logs from Spark's History

Re: scala.MatchError: class org.apache.avro.Schema (of class java.lang.Class)

2015-04-07 Thread Yamini Maddirala
Hi Michael, Yes, I did try spark-avro 0.2.0 databricks project. I am using CHD5.3 which is based on spark 1.2. Hence I'm bound to use spark-avro 0.2.0 instead of the latest. I'm not sure how spark-avro project can help me in this scenario. 1. I have JavaDStream of type avro generic record

How to get SparkSql results on a webpage on real time

2015-04-07 Thread Mukund Ranjan (muranjan)
Hi, I have written a scala object which can do query on the messages which I am receiving from Kafka. Now I have to show it on some webpage or dashboard which can auto refresh with new results.. Any pointer how can I do that.. Thanks, Mukund

How to generate Java bean class for avro files using spark avro project

2015-04-07 Thread Yamini
Is there a way to generate Java bean for a given avro schema file in spark 1.2 using spark-avro project 0.2.0 for following use case? 1. Topics from kafka read and stored in the form of avro generic records :JavaDStreamGenericRecords 2. Using spark avro project able to get the schema in the

Re: From DataFrame to LabeledPoint

2015-04-07 Thread Sergio Jiménez Barrio
Solved! Thanks for ur help. I had converted null values to Double value (0.0) El 06/04/2015 19:25, Joseph Bradley jos...@databricks.com escribió: I'd make sure you're selecting the correct columns. If not that, then your input data might be corrupt. CCing user to keep it on the user list.

Re: A problem with Spark 1.3 artifacts

2015-04-07 Thread Marcelo Vanzin
Maybe you have some sbt-built 1.3 version in your ~/.ivy2/ directory that's masking the maven one? That's the only explanation I can come up with... On Tue, Apr 7, 2015 at 12:22 PM, Jacek Lewandowski jacek.lewandow...@datastax.com wrote: So weird, as I said - I created a new empty project

Re: Advice using Spark SQL and Thrift JDBC Server

2015-04-07 Thread Michael Armbrust
That should totally work. The other option would be to run a persistent metastore that multiple contexts can talk to and periodically run a job that creates missing tables. The trade-off here would be more complexity, but less downtime due to the server restarting. On Tue, Apr 7, 2015 at 12:34

set spark.storage.memoryFraction to 0 when no cached RDD and memory area for broadcast value?

2015-04-07 Thread Shuai Zheng
Hi All, I am a bit confused on spark.storage.memoryFraction, this is used to set the area for RDD usage, will this RDD means only for cached and persisted RDD? So if my program has no cached RDD at all (means that I have no .cache() or .persist() call on any RDD), then I can set this

Re: A problem with Spark 1.3 artifacts

2015-04-07 Thread Jacek Lewandowski
So weird, as I said - I created a new empty project where Spark core was the only dependency... [image: datastax_logo.png] http://www.datastax.com/ JACEK LEWANDOWSKI DSE Software Engineer | +48.609.810.774 | jacek.lewandow...@datastax.com [image: linkedin.png]

Re: Advice using Spark SQL and Thrift JDBC Server

2015-04-07 Thread James Aley
Hi Michael, Thanks so much for the reply - that really cleared a lot of things up for me! Let me just check that I've interpreted one of your suggestions for (4) correctly... Would it make sense for me to write a small wrapper app that pulls in hive-thriftserver as a dependency, iterates my

broken link on Spark Programming Guide

2015-04-07 Thread jonathangreenleaf
in the current Programming Guide: https://spark.apache.org/docs/1.3.0/programming-guide.html#actions under Actions, the Python link goes to: https://spark.apache.org/docs/1.3.0/api/python/pyspark.rdd.RDD-class.html which is 404 which I think should be:

Re: broken link on Spark Programming Guide

2015-04-07 Thread Ted Yu
For the last link, you might have meant: https://spark.apache.org/docs/1.3.0/api/python/pyspark.html#pyspark.RDD Cheers On Tue, Apr 7, 2015 at 1:32 PM, jonathangreenleaf jonathangreenl...@gmail.com wrote: in the current Programming Guide:

Re: 'Java heap space' error occured when query 4G data file from HDFS

2015-04-07 Thread 李铖
Any help?please. Help me do a right configure. 李铖 lidali...@gmail.com于2015年4月7日星期二写道: In my dev-test env .I have 3 virtual machines ,every machine have 12G memory,8 cpu core. Here is spark-defaults.conf,and spark-env.sh.Maybe some config is not right. I run this command :*spark-submit

parquet partition discovery

2015-04-07 Thread Christopher Petro
I was unable to get this feature to work in 1.3.0. I tried building off master and it still wasn't working for me. So I dug into the code, and I'm not sure how the parsePartition() was ever working. The while loop which walks up the parent directories in the path always terminates after a

RE: 'Java heap space' error occured when query 4G data file from HDFS

2015-04-07 Thread java8964
It is hard to guess why OOM happens without knowing your application's logic and the data size. Without knowing that, I can only guess based on some common experiences: 1) increase spark.default.parallelism2) Increase your executor-memory, maybe 6g is not just enough 3) Your environment is kind

Re: 'Java heap space' error occured when query 4G data file from HDFS

2015-04-07 Thread Ted Yu
李铖: w.r.t. #5, you can use --executor-cores when invoking spark-submit Cheers On Tue, Apr 7, 2015 at 2:35 PM, java8964 java8...@hotmail.com wrote: It is hard to guess why OOM happens without knowing your application's logic and the data size. Without knowing that, I can only guess based on

Re: Array[T].distinct doesn't work inside RDD

2015-04-07 Thread Anny Chen
Hi Sean, I didn't override hasCode. But the problem is that Array[T].toSet could work but Array[T].distinct couldn't. If it is because I didn't override hasCode, then toSet shouldn't work either right? I also tried using this Array[T].distinct outside RDD, and it is working alright also,

How to use Joda Time with Spark SQL?

2015-04-07 Thread adamgerst
I've been using Joda Time in all my spark jobs (by using the nscala-time package) and have not run into any issues until I started trying to use spark sql. When I try to convert a case class that has a com.github.nscala_time.time.Imports.DateTime object in it, an exception is thrown for with a

Unable to run spark examples on cloudera

2015-04-07 Thread Georgi Knox
Hi There, We’ve just started to trial out Spark at Bitly. We are running Spark 1.2.1 on Cloudera(CDH-5.3.0) with Hadoop 2.5.0 and am running into issues even just trying to run the python examples. Its just being run in standalone mode i believe. $ ./bin/spark-submit —driver-memory 2g

Re: ML consumption time based on data volume - same cluster

2015-04-07 Thread Xiangrui Meng
This could be empirically verified in spark-perf: https://github.com/databricks/spark-perf. Theoretically, it would be 2x for k-means and logistic regression, because computation is doubled but communication cost remains the same. -Xiangrui On Tue, Apr 7, 2015 at 7:15 AM, Vasyl Harasymiv

Re: Job submission API

2015-04-07 Thread Veena Basavaraj
The following might be helpful. http://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/What-dependencies-to-submit-Spark-jobs-programmatically-not-via/td-p/24721 http://blog.sequenceiq.com/blog/2014/08/22/spark-submit-in-java/ On 7 April 2015 at 16:32, michal.klo...@gmail.com

Specifying Spark property from command line?

2015-04-07 Thread Arun Lists
Hi, Is it possible to specify a Spark property like spark.local.dir from the command line when running an application using spark-submit? Thanks, arun

Job submission API

2015-04-07 Thread Prashant Kommireddi
Hello folks, Newbie here! Just had a quick question - is there a job submission API such as the one with hadoop https://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/mapreduce/Job.html#submit() to submit Spark jobs to a Yarn cluster? I see in example that bin/spark-submit is what's out

Re: Job submission API

2015-04-07 Thread michal.klo...@gmail.com
A SparkContext can submit jobs remotely. The spark-submit options in general can be populated into a SparkConf and passed in when you create a SparkContext. We personally have not had too much success with yarn-client remote submission, but standalone cluster mode was easy to get going. M

Re: ML consumption time based on data volume - same cluster

2015-04-07 Thread Vasyl Harasymiv
Thank you Xiangrui, Indeed, however, if the computation involves taking matrix, even locally, like random forest, if data increases 2x, even local computation time should increase 2x. But I will test it with the Spark Perf and let you know! On Tue, Apr 7, 2015 at 4:50 PM, Xiangrui Meng

Re: Job submission API

2015-04-07 Thread HARIPRIYA AYYALASOMAYAJULA
Hello, If you are looking for the command to submit the following command works: spark-submit --class SampleTest --master yarn-cluster --num-executors 4 --executor-cores 2 /home/priya/Spark/Func1/target/scala-2.10/simple-project_2.10-1.0.jar On Tue, Apr 7, 2015 at 6:36 PM, Veena Basavaraj

Re: DataFrame groupBy MapType

2015-04-07 Thread Justin Yip
Thanks Michael. Will submit a ticket. Justin On Mon, Apr 6, 2015 at 1:53 PM, Michael Armbrust mich...@databricks.com wrote: I'll add that I don't think there is a convenient way to do this in the Column API ATM, but would welcome a JIRA for adding it :) On Mon, Apr 6, 2015 at 1:45 PM,

Error when running Spark on Windows 8.1

2015-04-07 Thread Arun Lists
Hi, We are trying to run a Spark application using spark-submit on Windows 8.1. The application runs successfully to completion on MacOS 10.10 and on Ubuntu Linux. On Windows, we get the following error messages (see below). It appears that Spark is trying to delete some temporary directory that

Re: Specifying Spark property from command line?

2015-04-07 Thread Arun Lists
I just figured this out from the documentation: --conf spark.local.dir=C:\Temp On Tue, Apr 7, 2015 at 5:00 PM, Arun Lists lists.a...@gmail.com wrote: Hi, Is it possible to specify a Spark property like spark.local.dir from the command line when running an application using spark-submit?

Drools in Spark

2015-04-07 Thread Sathish Kumaran Vairavelu
Hello, Just want to check if anyone has tried drools with Spark? Please let me know. Are there any alternate rule engine that works well with Spark? Thanks Sathish

Expected behavior for DataFrame.unionAll

2015-04-07 Thread Justin Yip
Hello, I am experimenting with DataFrame. I tried to construct two DataFrames with: 1. case class A(a: Int, b: String) scala adf.printSchema() root |-- a: integer (nullable = false) |-- b: string (nullable = true) 2. case class B(a: String, c: Int) scala bdf.printSchema() root |-- a: string

Re: Array[T].distinct doesn't work inside RDD

2015-04-07 Thread Sean Owen
I suppose it depends a lot on the implementations. In general, distinct and toSet work when hashCode and equals are defined correctly. When that isn't the case, the result isn't defined; it might happen to work in some cases. This could well explain why you see different results. Why not implement

Timeout errors from Akka in Spark 1.2.1

2015-04-07 Thread Nikunj Bansal
I have a standalone and local Spark streaming process where we are reading inputs using FlumeUtils. Our longest window size is 6 hours. After about a day and a half of running without any issues, we start seeing Timeout errors while cleaning up input blocks. This seems to cause reading from Flume

Re: Spark TeraSort source request

2015-04-07 Thread Pramod Biligiri
+1. I would love to have the code for this as well. Pramod On Fri, Apr 3, 2015 at 12:47 PM, Tom thubregt...@gmail.com wrote: Hi all, As we all know, Spark has set the record for sorting data, as published on: https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html. Here at our

Re: broken link on Spark Programming Guide

2015-04-07 Thread Sean Owen
I fixed this a while ago in master. It should go out with the next release and next push of the site. On Tue, Apr 7, 2015 at 4:32 PM, jonathangreenleaf jonathangreenl...@gmail.com wrote: in the current Programming Guide: https://spark.apache.org/docs/1.3.0/programming-guide.html#actions under

HiveThriftServer2

2015-04-07 Thread Mohammed Guller
Hi - I want to create an instance of HiveThriftServer2 in my Scala application, so I imported the following line: import org.apache.spark.sql.hive.thriftserver._ However, when I compile the code, I get the following error: object thriftserver is not a member of package

Re: broken link on Spark Programming Guide

2015-04-07 Thread Jonathan Greenleaf
Awesome. thank you! On Apr 7, 2015 8:55 PM, Sean Owen so...@cloudera.com wrote: I fixed this a while ago in master. It should go out with the next release and next push of the site. On Tue, Apr 7, 2015 at 4:32 PM, jonathangreenleaf jonathangreenl...@gmail.com wrote: in the current

value reduceByKeyAndWindow is not a member of org.apache.spark.streaming.dstream.DStream[(String, Int)]

2015-04-07 Thread Su She
Hello Everyone, I am trying to implement this example (Spark Streaming with Twitter). https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/TwitterPopularTags.scala I am able to do: hashTags.print() to get a live stream of filtered hashtags,

Re: Spark + Kinesis

2015-04-07 Thread Vadim Bichutskiy
Hey y'all, While I haven't been able to get Spark + Kinesis integration working, I pivoted to plan B: I now push data to S3 where I set up a DStream to monitor an S3 bucket with textFileStream, and that works great. I 3 Spark! Best, Vadim ᐧ On Mon, Apr 6, 2015 at 12:23 PM, Vadim Bichutskiy

DataFrame degraded performance after DataFrame.cache

2015-04-07 Thread Justin Yip
Hello, I have a parquet file of around 55M rows (~ 1G on disk). Performing simple grouping operation is pretty efficient (I get results within 10 seconds). However, after called DataFrame.cache, I observe a significant performance degrade, the same operation now takes 3+ minutes. My hunch is

Re: DataFrame degraded performance after DataFrame.cache

2015-04-07 Thread Yin Huai
Hi Justin, Does the schema of your data have any decimal, array, map, or struct type? Thanks, Yin On Tue, Apr 7, 2015 at 6:31 PM, Justin Yip yipjus...@prediction.io wrote: Hello, I have a parquet file of around 55M rows (~ 1G on disk). Performing simple grouping operation is pretty

Re: DataFrame degraded performance after DataFrame.cache

2015-04-07 Thread Justin Yip
The schema has a StructType. Justin On Tue, Apr 7, 2015 at 6:58 PM, Yin Huai yh...@databricks.com wrote: Hi Justin, Does the schema of your data have any decimal, array, map, or struct type? Thanks, Yin On Tue, Apr 7, 2015 at 6:31 PM, Justin Yip yipjus...@prediction.io wrote: Hello,

Cannot change the memory of workers

2015-04-07 Thread Jia Yu
Hi guys, Currently I am running Spark program on Amazon EC2. Each worker has around (less than but near to )2 gb memory. By default, I can see each worker is allocated 976 mb memory as the table shows below on Spark WEB UI. I know this value is from (Total memory minus 1 GB). But I want more

Caching and Actions

2015-04-07 Thread spark_user_2015
I understand that RDDs are not created until an action is called. Is it a correct conclusion that it doesn't matter if .cache is used anywhere in the program if I only have one action that is called only once? Related to this question, consider this situation: val d1 = data.map((x,y,z) = (x,y))

Unable to specify multiple directories as input

2015-04-07 Thread ๏̯͡๏
Hello, I have two HDFS directories each containing multiple avro files. I want to specify these two directories as input. In Hadoop world, one can specify list of comma separated directories. In Spark that does not work. Logs 15/04/07 21:10:11 INFO storage.BlockManagerMaster: Updated info

Re: Unable to specify multiple directories as input

2015-04-07 Thread ๏̯͡๏
Spark Version 1.3 Command: ./bin/spark-submit -v --master yarn-cluster --driver-class-path

Re: DataFrame degraded performance after DataFrame.cache

2015-04-07 Thread Justin Yip
Thanks for the explanation Yin. Justin On Tue, Apr 7, 2015 at 7:36 PM, Yin Huai yh...@databricks.com wrote: I think the slowness is caused by the way that we serialize/deserialize the value of a complex type. I have opened https://issues.apache.org/jira/browse/SPARK-6759 to track the