Re: SparkR Error in sparkR.init(master=“local”) in RStudio

2015-10-06 Thread Felix Cheung
Is it possible that your user does not have permission to write temp file? On Tue, Oct 6, 2015 at 10:26 AM -0700, "akhandeshi" wrote: It seems it is failing at path <- tempfile(pattern = "backend_port") I do not see backend_port directory created... -- View

Re: Help needed to reproduce bug

2015-10-06 Thread Jean-Baptiste Onofré
Hi Nick, I will try to reproduce your issue on a couple of environment. Just wanted which kind of environment you use: spark standalone, spark on yarn, or spark on mesos ? For you, does it occur with any transform() on any RDD or do you use specific RDD ? I plan to use your code in a main

Help needed to reproduce bug

2015-10-06 Thread pnpritchard
Hi spark community, I was hoping someone could help me by running a code snippet below in the spark shell, and seeing if they see the same buggy behavior I see. Full details of the bug can be found in this JIRA issue I filed: https://issues.apache.org/jira/browse/SPARK-10942. The issue was

Re: GenericMutableRow and Row mismatch on Spark 1.5?

2015-10-06 Thread Hemant Bhanawat
An approach can be to wrap your MutableRow in WrappedInternalRow which is a child class of Row. Hemant www.snappydata.io linkedin.com/company/snappydata On Tue, Oct 6, 2015 at 3:21 PM, Ophir Cohen wrote: > Hi Guys, > I'm upgrading to Spark 1.5. > > In our previous version

Re: Notification on Spark Streaming job failure

2015-10-06 Thread Krzysztof Zarzycki
Hi Vikram, So you give up using yarn-cluster mode of launching Spark jobs, is that right? AFAIK when using yarn-cluster mode, the launch process (spark-submit) monitors job running on YARN, but if it is killed/dies, it just stops printing the state (RUNNING usually), without influencing the

Re: Notification on Spark Streaming job failure

2015-10-06 Thread Vikram Kone
We are using Monit to kick off spark streaming jobs n seems to work fine. On Monday, September 28, 2015, Chen Song wrote: > I am also interested specifically in monitoring and alerting on Spark > streaming jobs. It will be helpful to get some general guidelines or advice

Re: Broadcast var is null

2015-10-06 Thread Nick Peterson
This might seem silly, but... Stop having your object extend App, and instead give it a main method. That's worked for me recently when I've had this issue. (There was a very old issue in Spark related to this; it would seem like a possible regression, if this fixes it for you.) -- Nick On Tue,

Re: does KafkaCluster can be public ?

2015-10-06 Thread Ted Yu
Or maybe annotate with @DeveloperApi Cheers On Tue, Oct 6, 2015 at 7:24 AM, Cody Koeninger wrote: > I personally think KafkaCluster (or the equivalent) should be made > public. When I'm deploying spark I just sed out the private[spark] and > rebuild. > > There's a general

Re: does KafkaCluster can be public ?

2015-10-06 Thread Sean Owen
For what it's worth, I also use this class in an app, but it happens to be from Java code where it acts as if it's public. So no problem for my use case, but I suppose, another small vote for the usefulness of this class to the caller. I end up using getLatestLeaderOffsets to figure out how to

RE: SparkR Error in sparkR.init(master=“local”) in RStudio

2015-10-06 Thread Khandeshi, Ami
> Sys.setenv(SPARKR_SUBMIT_ARGS="--verbose sparkr-shell") > Sys.setenv(SPARK_PRINT_LAUNCH_COMMAND=1) > > sc <- sparkR.init(master="local") Launching java with spark-submit command /C/DevTools/spark-1.5.1/bin/spark-submit.cmd --verbose sparkr-shell

Re: Broadcast var is null

2015-10-06 Thread Sean Owen
Yes, see https://issues.apache.org/jira/browse/SPARK-4170 The reason was kind of complicated, and the 'fix' was just to warn you against subclassing App! yes, use a main() method. On Tue, Oct 6, 2015 at 3:15 PM, Nick Peterson wrote: > This might seem silly, but... > >

Trying PCA on spark but serialization is error thrown

2015-10-06 Thread Simon Hebert
Hi, I tried to used the PCA object in one of my project but end up receiving a serialization error. Any help would be appreciated. Example taken from https://spark.apache.org/docs/latest/mllib-feature-extraction.html#pca My Code: val selector = new PCA(20) val transformer =

Re: [Spark on YARN] Multiple Auxiliary Shuffle Service Versions

2015-10-06 Thread Steve Loughran
On 6 Oct 2015, at 01:23, Andrew Or > wrote: Both the history server and the shuffle service are backward compatible, but not forward compatible. This means as long as you have the latest version of history server / shuffle service running in

Re: does KafkaCluster can be public ?

2015-10-06 Thread Cody Koeninger
I personally think KafkaCluster (or the equivalent) should be made public. When I'm deploying spark I just sed out the private[spark] and rebuild. There's a general reluctance to make things public due to backwards compatibility, but if enough people ask for it... ? On Tue, Oct 6, 2015 at 6:51

Re: Spark job workflow engine recommendations

2015-10-06 Thread Vikram Kone
Does Azkaban support scheduling long running jobs like spark steaming jobs? Will Azkaban kill a job if it's running for a long time. On Friday, August 7, 2015, Vikram Kone wrote: > Hien, > Is Azkaban being phased out at linkedin as rumored? If so, what's linkedin > going

Deep learning example using spark

2015-10-06 Thread Angel Angel
Hello Sir/Madam, I am working on the deep learning using spark. I have implemented some algorithms using spark. bot now i want to used the imageNet database in spark-1.0.4.. Can you give me some guideline or reference so i can handle the imageNet database. Thanking You, Sagar Jadhav.

Re: How to avoid Spark shuffle spill memory?

2015-10-06 Thread David Mitchell
Hi unk1102, Try adding more memory to your nodes. Are you running Spark in the cloud? If so, increase the memory on your servers. Do you have default parallelism set (spark.default.parallelism)? If so, unset it, and let Spark decided how many partitions to allocate. You can also try refactoring

Re: Broadcast var is null

2015-10-06 Thread dpristin
This advice solved the problem: "Stop having your object extend App, and instead give it a main method." https://issues.apache.org/jira/browse/SPARK-4170 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Broadcast-var-is-null-tp24927p24959.html Sent from

Spark cache memory storage

2015-10-06 Thread Lan Jiang
Hi, there My understanding is that the cache storage is calculated as following executor heap size * spark.storage.safetyFraction * spark.storage.memoryFraction. The default value for safetyFraction is 0.9 and memoryFraction is 0.6. When I started a spark job on YARN, I set executor-memory to

Re: How can I disable logging when running local[*]?

2015-10-06 Thread Jeff Jones
Here’s an example. I echoed JAVA_OPTS so that you can see what I’ve got. Then I call ‘activator run’ in the project directory. jjones-mac:analyzer-perf jjones$ echo $JAVA_OPTS -Xmx4g -Xmx4g

Re: compatibility issue with Jersey2

2015-10-06 Thread Marcelo Vanzin
On Tue, Oct 6, 2015 at 12:04 PM, Gary Ogden wrote: > But we run unit tests differently in our build environment, which is > throwing the error. It's setup like this: > > I suspect this is what you were referring to when you said I have a problem? Yes, that is what I was

Re: How can I disable logging when running local[*]?

2015-10-06 Thread Alexander Pivovarov
The easiest way to control logging in spark shell is to run Logger.setLevel commands at the beginning of your program e.g. org.apache.log4j.Logger.getLogger("com.amazon").setLevel(org.apache.log4j.Level.WARN) org.apache.log4j.Logger.getLogger("com.amazonaws").setLevel(org.apache.log4j.Level.WARN)

Re: ORC files created by Spark job can't be accessed using hive table

2015-10-06 Thread Umesh Kacha
Thanks Michael so the following code written using Spark 1.5.1 should be able to recognise by Hive table right dataFrame.write().mode(SaveMode.Append).partitionBy(" entity","date").format("orc").save("baseTable"); Hive console: Create external table bla bla stored as ORC Location

Re: Weird performance pattern of Spark Streaming (1.4.1) + direct Kafka

2015-10-06 Thread Gerard Maas
> performance pattern we cannot explain. >> In alternating fashion, one task takes about 1 second to finish and the >> next takes 7sec for a stable streaming rate. >> >> Here are comparable metrics for two successive tasks: >> *Slow*: >> >> >> ​ >&

Re: Weird performance pattern of Spark Streaming (1.4.1) + direct Kafka

2015-10-06 Thread Cody Koeninger
to finish and the > next takes 7sec for a stable streaming rate. > > Here are comparable metrics for two successive tasks: > *Slow*: > > > ​ > > Executor IDAddressTask TimeTotal TasksFailed TasksSucceeded Tasks > 20151006-044141-2408867082-5050-21047-S0dnode-3.hdfs.

Re: compatibility issue with Jersey2

2015-10-06 Thread Gary Ogden
In our separate environments we run it with spark-submit, so I can give that a try. But we run unit tests differently in our build environment, which is throwing the error. It's setup like this: helper = new CassandraHelper(settings.getCassandra().get()); SparkConf sparkConf =

How to avoid Spark shuffle spill memory?

2015-10-06 Thread unk1102
Hi I have a Spark job which runs for around 4 hours and it shared SparkContext and runs many child jobs. When I see each job in UI I see shuffle spill of around 30 to 40 GB and because of that many times executors gets lost because of using physical memory beyond limits how do I avoid shuffle

Re: spark-submit --packages using different resolver

2015-10-06 Thread Jerry Lam
This is the ticket SPARK-10951 Cheers~ On Tue, Oct 6, 2015 at 11:33 AM, Jerry Lam wrote: > Hi Burak, > > Thank you for the tip. > Unfortunately it does not work. It throws: > > java.net.MalformedURLException: unknown

Re: Weird performance pattern of Spark Streaming (1.4.1) + direct Kafka

2015-10-06 Thread Cody Koeninger
>>> performance pattern we cannot explain. >>> In alternating fashion, one task takes about 1 second to finish and the >>> next takes 7sec for a stable streaming rate. >>> >>> Here are comparable metrics for two successive tasks: >>> *Slow*:

Re: Lookup / Access of master data in spark streaming

2015-10-06 Thread Olivier Girardot
That's great ! Thanks ! So to sum up, to do some kind of "always up-to-date" lookup we can use broadcast variables and re-broadcast when the data has changed using whether the "transform" RDD to RDD transformation, "foreachRDD" or transformWith. Thank you for your time Regards, 2015-10-05

Re: Spark SQL with Hive error: "Conf non-local session path expected to be non-null;"

2015-10-06 Thread er.jayants...@gmail.com
I have recently upgraded from spark 1.2 to spark 1.3. After upgrade I made necessary changes to incorporate DataFrames instead of JavaSchemaRDD. Now I am getting this error message *("org.apache.spark.sql.AnalysisException: Conf non-local session path expected to be non-null")* while running my

Re: 1.5 Build Errors

2015-10-06 Thread Benjamin Zaitlen
Hi All, Sean patiently worked with me in solving this issue. The problem was entirely my fault in settings MAVEN_OPTS env variable was set and was overriding everything. --Ben On Tue, Sep 8, 2015 at 1:37 PM, Benjamin Zaitlen wrote: > Yes, just reran with the following > >

Help with big data operation performance

2015-10-06 Thread Saif.A.Ellafi
Hi all, In a stand-alone cluster operation, with more than 80 gbs of ram in each node, am trying to: 1. load a partitioned json dataframe which weights around 100GB as input 2. apply transformations such as cast some column types 3. get some percentiles which involves sort by,

Does feature parity exist between Scala and Python on Spark

2015-10-06 Thread dant
Hi, I'm hearing a common theme running that I should only do serious programming in Scala on Spark (1.5.1). Real power users use Scala. It is said that Python is great for analytics but in the end the code should be written to Scala to finalise. There are a number of reasons I'm hearing: 1. Spark

Does feature parity exist between Spark and PySpark

2015-10-06 Thread dant
Hi I'm hearing a common theme running that I should only do serious programming in Scala on Spark (1.5.1). Real power users use Scala. It is said that Python is great for analytics but in the end the code should be written to Scala to finalise. There are a number of reasons I'm hearing: 1. Spark

Re: laziness in textFile reading from HDFS?

2015-10-06 Thread Matt Narrell
One. I read in LZO compressed files from HDFS Perform a map operation cache the results of this map operation call saveAsHadoopFile to write LZO back to HDFS. Without the cache, the job will stall. mn > On Oct 5, 2015, at 7:25 PM, Mohammed Guller wrote: > > Is there

Re: GraphX: How can I tell if 2 nodes are connected?

2015-10-06 Thread Dino Fancellu
Ok, thanks, just wanted to make sure I wasn't missing something obvious. I've worked with Neo4j cypher as well, where it was rather more obvious. e.g. http://neo4j.com/docs/milestone/query-match.html#_shortest_path http://neo4j.com/docs/stable/cypher-refcard/ Dino. On 6 October 2015 at 06:43,

Re: StructType has more rows, than corresponding Row has objects.

2015-10-06 Thread Eugene Morozov
Davies, that seemed to be my issue, my colleague helped me to resolved it. The problem was that we build RDD and corresponding StructType by ourselves (no json, parquet, cassandra, etc - we take a list of business objects and convert them to Rows, then infer struct type) and I missed one thing.

Re: How can I disable logging when running local[*]?

2015-10-06 Thread Alex Kozlov
Try JAVA_OPTS='-Dlog4j.configuration=file:/' Internally, this is just spark.driver.extraJavaOptions, which you should be able to set in conf/spark-defaults.conf Can you provide more details how you invoke the driver? On Tue, Oct 6, 2015 at 9:48 AM, Jeff Jones

Re: compatibility issue with Jersey2

2015-10-06 Thread Marcelo Vanzin
On Tue, Oct 6, 2015 at 5:57 AM, oggie wrote: > We have a Java app written with spark 1.3.1. That app also uses Jersey 2.9 > client to make external calls. We see spark 1.4.1 uses Jersey 1.9. How is this app deployed? If it's run via spark-submit, you could use

Re: Weird performance pattern of Spark Streaming (1.4.1) + direct Kafka

2015-10-06 Thread Jeff Nadler
>> We recently migrated our streaming jobs to the direct kafka receiver. >>> Our initial migration went quite fine but now we are seeing a weird zig-zag >>> performance pattern we cannot explain. >>> In alternating fashion, one task takes about 1 second to finish and

Re: [Spark on YARN] Multiple Auxiliary Shuffle Service Versions

2015-10-06 Thread Alex Rovner
Thank you all for your help. *Alex Rovner* *Director, Data Engineering * *o:* 646.759.0052 * * On Tue, Oct 6, 2015 at 11:17 AM, Steve Loughran wrote: > > On 6 Oct 2015, at 01:23, Andrew Or wrote: > > Both the history

Re: Does feature parity exist between Spark and PySpark

2015-10-06 Thread Richard Eggert
Since the Python API is built on top of the Scala implementation, its performance can be at best roughly the same as that of the Scala API (as in the case of DataFrames and SQL) and at worst several orders of magnitude slower. Likewise, since the a Scala implementation of new features

Re: Does feature parity exist between Spark and PySpark

2015-10-06 Thread Richard Eggert
That should have read "a lot of neat tricks", not "a lot of nest tricks". That's what I get for sending emails on my phone On Oct 6, 2015 8:32 PM, "Richard Eggert" wrote: > Since the Python API is built on top of the Scala implementation, its > performance can be

Re: Does feature parity exist between Scala and Python on Spark

2015-10-06 Thread DW @ Gmail
While I have a preference for Scala ( not surprising as a Typesafe person), the DataFrame API gives feature and performance parity for Python. The RDD API gives feature parity. So, use what makes you most successful for other reasons ;) Sent from my rotary phone. > On Oct 6, 2015, at 4:14

Re: Does feature parity exist between Spark and PySpark

2015-10-06 Thread Don Drake
If you are using Dataframes in PySpark, then the performance will be the same as Scala. However, if you need to implement your own UDF, or run a map() against a DataFrame in Python, then you will pay the penalty for performance when executing those functions since all of your data has to go

unresolved dependency: org.apache.spark#spark-streaming_2.10;1.5.0: not found

2015-10-06 Thread shahab
Hi, I am trying to use Spark 1.5, Mlib, but I keep getting "sbt.ResolveException: unresolved dependency: org.apache.spark#spark-streaming_2.10;1.5.0: not found" . It is weird that this happens, but I could not find any solution for this. Does any one faced the same issue? best, /Shahab Here

does KafkaCluster can be public ?

2015-10-06 Thread Erwan ALLAIN
Hello, I'm currently testing spark streaming with kafka. I'm creating DirectStream with KafkaUtils and everything's fine. However I would like to use the signature where I can specify my own message handler (to play with partition and offset). In this case, I need to manage offset/partition by

GenericMutableRow and Row mismatch on Spark 1.5?

2015-10-06 Thread Ophir Cohen
Hi Guys, I'm upgrading to Spark 1.5. In our previous version (Spark 1.3 but it was OK on 1.4 as well) we created GenericMutableRow (org.apache.spark.sql.catalyst.expressions.GenericMutableRow) and return it as org.apache.spark.sql.Row Starting from Spark 1.5 GenericMutableRow isn't extends Row.

Re: Spark thrift service and Hive impersonation.

2015-10-06 Thread Steve Loughran
On 5 Oct 2015, at 22:51, Jagat Singh > wrote: Hello Steve, Thanks for confirmation. Is there any work planned work on this. Not that I'm aware of, though somebody may be doing it. SparkSQL is not hive. It uses some of the libraries -the

RE: laziness in textFile reading from HDFS?

2015-10-06 Thread Mohammed Guller
I have not used LZO compressed files from Spark, so not sure why it stalls without caching. In general, if you are going to make just one pass over the data, there is not much benefit in caching it. The data gets read anyway only after the first action is called. If you are calling just a map

Re: laziness in textFile reading from HDFS?

2015-10-06 Thread Matt Narrell
Agreed. This is spark 1.2 on CDH5.x. How do you mitigate where the data sets are larger than available memory? My jobs stall and gc/heap issues all over the place. ..via mobile > On Oct 6, 2015, at 4:44 PM, Mohammed Guller wrote: > > I have not used LZO compressed

Re: Weird performance pattern of Spark Streaming (1.4.1) + direct Kafka

2015-10-06 Thread Cody Koeninger
it's >>>> consistent between the two situations. >>>> >>>> On Tue, Oct 6, 2015 at 11:45 AM, Gerard Maas <gerard.m...@gmail.com> >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> We recently migrated our s

Re: RDD of ImmutableList

2015-10-06 Thread Jonathan Coveney
Nobody is saying not to use immutable data structures, only that guava's aren't natively supported. Scala's default collections library is all immutable. list, Vector, Map. This is what people generally use, especially in scala code! El martes, 6 de octubre de 2015, Jakub Dubovsky <

Re: Weird performance pattern of Spark Streaming (1.4.1) + direct Kafka

2015-10-06 Thread Tathagata Das
get some metrics on the time taken by >>>>> your code on the executors (e.g. when processing the iterator) to see if >>>>> it's consistent between the two situations. >>>>> >>>>> On Tue, Oct 6, 2015 at 11:45 AM, Gerard Maas <gerard

Re: does KafkaCluster can be public ?

2015-10-06 Thread Tathagata Das
Given the interest, I am also inclining towards making it a public developer API. Maybe even experimental. Cody, mind submitting a patch? On Tue, Oct 6, 2015 at 7:45 AM, Sean Owen wrote: > For what it's worth, I also use this class in an app, but it happens > to be from

RE: laziness in textFile reading from HDFS?

2015-10-06 Thread Mohammed Guller
It is not uncommon to process datasets larger than available memory with Spark. I don't remember whether LZO files are splittable. Perhaps, in your case Spark is running into issues while decompressing a large LZO file. See if this helps:

Re: does KafkaCluster can be public ?

2015-10-06 Thread Cody Koeninger
Sure no prob. On Tue, Oct 6, 2015 at 6:35 PM, Tathagata Das wrote: > Given the interest, I am also inclining towards making it a public > developer API. Maybe even experimental. Cody, mind submitting a patch? > > > On Tue, Oct 6, 2015 at 7:45 AM, Sean Owen

Re: laziness in textFile reading from HDFS?

2015-10-06 Thread Jonathan Coveney
LZO files are not splittable by default but there are projects with Input and Output formats to make splittable LZO files. Check out twitter's elephantbird on GitHub El miércoles, 7 de octubre de 2015, Mohammed Guller escribió: > It is not uncommon to process datasets

Re: Weird performance pattern of Spark Streaming (1.4.1) + direct Kafka

2015-10-06 Thread Tathagata Das
2015 at 11:45 AM, Gerard Maas <gerard.m...@gmail.com> >>> wrote: >>> >>>> Hi, >>>> >>>> We recently migrated our streaming jobs to the direct kafka receiver. >>>> Our initial migration went quite fine but now we are seeing a weird

Re: Does feature parity exist between Spark and PySpark

2015-10-06 Thread ayan guha
Hi 2 cents 1. It should not be true anymore if data frames are used. The reason is regardless of the language DF uses same optimization engine behind the scene. 2. This is generally true in the sense Python APIs are typically little behind of scala/java ones. Best Ayan On Wed, Oct 7, 2015 at

Re: laziness in textFile reading from HDFS?

2015-10-06 Thread Jonathan Coveney
LZO files are not splittable by default but there are projects with Input and Output formats to make splittable LZO files. Check out twitter's elephantbird on GitHub El miércoles, 7 de octubre de 2015, Mohammed Guller escribió: > It is not uncommon to process datasets

RE: DStream Transformation to save JSON in Cassandra 2.1

2015-10-06 Thread Prateek .
Thanks Ashish and Jean. I am using scala API, so I used the case class case class Coordinate(id: String, ax: Double, ay: Double, az: Double, oa: Double, ob: Double, og:Double) def mapToCoordinate(jsonMap: Map[String,Any]): Coordinate = { //map the coordinate } val lines =

RE: Graphx hangs and crashes on EdgeRDD creation

2015-10-06 Thread William Saar
Hi, I get the same problem with both the CanonicalVertexCut and RandomVertexCut, with the graph code as follows val graph = Graph.fromEdgeTuples(indexedEdges, 0, None, StorageLevel.MEMORY_AND_DISK_SER, StorageLevel.MEMORY_AND_DISK_SER); graph.partitionBy(PartitionStrategy.RandomVertexCut);

compatibility issue with Jersey2

2015-10-06 Thread oggie
I have some jersey compatibility issues when I tried to upgrade from 1.3.1 to 1.4.1.. We have a Java app written with spark 1.3.1. That app also uses Jersey 2.9 client to make external calls. We see spark 1.4.1 uses Jersey 1.9. In 1.3.1 we were able to add some exclusions to our pom and

Re: SparkR Error in sparkR.init(master=“local”) in RStudio

2015-10-06 Thread akhandeshi
I couldn't get this working... I have have JAVA_HOME set. I have defined SPARK_HOME Sys.setenv(SPARK_HOME="c:\DevTools\spark-1.5.1") .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths())) library("SparkR", lib.loc="c:\\DevTools\\spark-1.5.1\\lib") library(SparkR)

Re: Broadcast var is null

2015-10-06 Thread dpristin
I've reduced the code to the code below - no streaming, no Kafka, no checkpoint. Unfortunately the end result is the same - "broadcastVar is null" printed in the worker log. Any suggestion on what I'm missing would be very much appreciated ! object BroadcastTest extends App { val logger =

Re: does KafkaCluster can be public ?

2015-10-06 Thread Jonathan Coveney
You can put a class in the org.apache.spark namespace to access anything that is private[spark]. You can then make enrichments there to access whatever you need. Just beware upgrade pain :) El martes, 6 de octubre de 2015, Erwan ALLAIN escribió: > Hello, > > I'm

Re: [Spark on YARN] Multiple Auxiliary Shuffle Service Versions

2015-10-06 Thread Andreas Fritzler
Hi Andrew, thanks a lot for the clarification! Regards, Andreas On Tue, Oct 6, 2015 at 2:23 AM, Andrew Or wrote: > Hi all, > > Both the history server and the shuffle service are backward compatible, > but not forward compatible. This means as long as you have the

Re: compatibility issue with Jersey2

2015-10-06 Thread Ted Yu
Maybe build Spark with -Djersey.version=2.9 ? Cheers On Tue, Oct 6, 2015 at 5:57 AM, oggie wrote: > I have some jersey compatibility issues when I tried to upgrade from 1.3.1 > to > 1.4.1.. > > We have a Java app written with spark 1.3.1. That app also uses Jersey 2.9 >

Re: extracting the top 100 values from an rdd and save it as text file

2015-10-06 Thread gtanguy
Hello patelmiteshn, This could do the trick : rdd1 = rdd.sortBy(lambda x: x[1], ascending=False) rdd2 = rdd1.zipWithIndex().filter(tuple => tuple._2 < 1) rdd2.saveAsTextFile() -- View this message in context:

Enabling kryo serialization slows down machine learning app.

2015-10-06 Thread fede.sc
Hi, my team is setting up a machine-learning framework based on Spark's mlib, that currently uses logistic regression. I enabled Kryo serialization and enforced class registration, so I know that all the serialized classes are registered. However, the running times when Kryo serialization is

RE: SparkR Error in sparkR.init(master=“local”) in RStudio

2015-10-06 Thread Sun, Rui
What you have done is supposed to work. Need more debugging information to find the cause. Could you add the following lines before calling sparkR.init()? Sys.setenv(SPARKR_SUBMIT_ARGS="--verbose sparkr-shell") Sys.setenv(SPARK_PRINT_LAUNCH_COMMAND=1) Then to see if you can find any hint in

Trying PCA on spark but serialization is error thrown

2015-10-06 Thread Cukoo
Hi, I tried to used the PCA object in one of my project but end up receiving a serialization error. Any help would be appreciated. Example taken from https://spark.apache.org/docs/latest/mllib-feature-extraction.html#pca My Code: val selector = new PCA(20) val transformer =

Re: How can I disable logging when running local[*]?

2015-10-06 Thread Jeff Jones
Thanks. Any chance you know how to pass this to a Scala app that is run via TypeSafe activator? I tried putting it $JAVA_OPTS but I get: Unrecognized option: --driver-java-options Error: Could not create the Java Virtual Machine. Error: A fatal exception has occurred. Program will exit. I

Re: ORC files created by Spark job can't be accessed using hive table

2015-10-06 Thread Ted Yu
See this thread: http://search-hadoop.com/m/q3RTtwwjNxXvPEe1 A brief search in Spark JIRAs didn't find anything opened on this subject. On Tue, Oct 6, 2015 at 8:51 AM, unk1102 wrote: > Hi I have a spark job which creates ORC files in partitions using the > following code

API to run spark Jobs

2015-10-06 Thread shahid qadri
Hi Folks How i can submit my spark app(python) to the cluster without using spark-submit, actually i need to invoke jobs from UI - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail:

Re: Spark 1.3.1 on Yarn not using all given capacity

2015-10-06 Thread Cesar Berezowski
3 cores* not 8 César. > Le 6 oct. 2015 à 19:08, Cesar Berezowski a écrit : > > I deployed hdp 2.3.1 and got spark 1.3.1, spark 1.4 is supposed to be > available as technical preview I think > > vendor’s forum ? you mean hortonworks' ? > > -- > Update on my info: > >

Re: ORC files created by Spark job can't be accessed using hive table

2015-10-06 Thread Michael Armbrust
I believe this is fixed in Spark 1.5.1 as long as the table is only using types that hive understands and is not partitioned. The problem with partitioned tables it that hive does not support dynamic discovery unless you manually run the repair command. On Tue, Oct 6, 2015 at 9:33 AM, Umesh

Re: SparkR Error in sparkR.init(master=“local”) in RStudio

2015-10-06 Thread Hossein
Have you built the Spark jars? Can you run the Spark Scala shell? --Hossein On Tuesday, October 6, 2015, Khandeshi, Ami wrote: > > Sys.setenv(SPARKR_SUBMIT_ARGS="--verbose sparkr-shell") > > Sys.setenv(SPARK_PRINT_LAUNCH_COMMAND=1) > > > > sc <-

Weird performance pattern of Spark Streaming (1.4.1) + direct Kafka

2015-10-06 Thread Gerard Maas
streaming rate. Here are comparable metrics for two successive tasks: *Slow*: ​ Executor IDAddressTask TimeTotal TasksFailed TasksSucceeded Tasks 20151006-044141-2408867082-5050-21047-S0dnode-3.hdfs.private:3686322 s303 20151006-044141-2408867082-5050-21047-S1dnode-0.hdfs.private:4381240 s11011

Re: API to run spark Jobs

2015-10-06 Thread Ted Yu
Please take a look at: org.apache.spark.deploy.rest.RestSubmissionClient which is used by core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala FYI On Tue, Oct 6, 2015 at 10:08 AM, shahid qadri wrote: > hi Jeff > Thanks > More specifically i need the Rest api

Re: SparkR Error in sparkR.init(master=“local”) in RStudio

2015-10-06 Thread akhandeshi
It seems it is failing at path <- tempfile(pattern = "backend_port") I do not see backend_port directory created... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-Error-in-sparkR-init-master-local-in-RStudio-tp23768p24958.html Sent from the

Spark 1.3.1 on Yarn not using all given capacity

2015-10-06 Thread czoo
Hi, This post might be a duplicate with updates from another one (by me), sorry in advance I have an HDP 2.3 cluster running Spark 1.3.1 on 6 nodes (edge + master + 4 workers) Each worker has 8 cores and 40G of RAM available in Yarn That makes a total of 160GB and 32 cores I'm running a job

Re: ORC files created by Spark job can't be accessed using hive table

2015-10-06 Thread Umesh Kacha
Hi Ted thanks I know I solved that by using dataframe for both reading and writing. I am running into different problem now if spark can read hive orc files why can't hive read orc files created by Spark? On Oct 6, 2015 9:28 PM, "Ted Yu" wrote: > See this thread: >

Re: API to run spark Jobs

2015-10-06 Thread Jeff Nadler
Spark standalone doesn't come with a UI for submitting jobs. Some Hadoop distros might, for example EMR in AWS has a job submit UI. Spark submit just calls a REST api, you could build any UI you want on top of that... On Tue, Oct 6, 2015 at 9:37 AM, shahid qadri

Re: Spark 1.3.1 on Yarn not using all given capacity

2015-10-06 Thread Ted Yu
Considering posting the question on vendor's forum. HDP 2.3 comes with Spark 1.4 if I remember correctly. On Tue, Oct 6, 2015 at 9:05 AM, czoo wrote: > Hi, > > This post might be a duplicate with updates from another one (by me), sorry > in advance > > I have an HDP 2.3

Re: API to run spark Jobs

2015-10-06 Thread shahid qadri
hi Jeff Thanks More specifically i need the Rest api to submit pyspark job, can you point me to Spark submit REST api > On Oct 6, 2015, at 10:25 PM, Jeff Nadler wrote: > > > Spark standalone doesn't come with a UI for submitting jobs. Some Hadoop > distros might, for

Re: API to run spark Jobs

2015-10-06 Thread Jeff Nadler
Yeah I was going to suggest looking at the code too. It's a shame there isn't a page in the docs that covers the port 6066 rest api. On Tue, Oct 6, 2015 at 10:16 AM, Ted Yu wrote: > Please take a look at: > org.apache.spark.deploy.rest.RestSubmissionClient > > which is

Re: spark-submit --packages using different resolver

2015-10-06 Thread Jerry Lam
Hi Burak, Thank you for the tip. Unfortunately it does not work. It throws: java.net.MalformedURLException: unknown protocol: s3n] at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1003) at

ORC files created by Spark job can't be accessed using hive table

2015-10-06 Thread unk1102
Hi I have a spark job which creates ORC files in partitions using the following code dataFrame.write().mode(SaveMode.Append).partitionBy("entity","date").format("orc").save("baseTable"); Above code creates successfully orc files which is readable in Spark dataframe But when I try to load orc