Re: spark 1.3.1 : unable to access s3n:// urls (no file system for scheme s3n:)

2015-04-23 Thread Sujee Maniyam
Thanks all... btw, s3n load works without any issues with spark-1.3.1-bulit-for-hadoop 2.4 I tried this on 1.3.1-hadoop26 sc.hadoopConfiguration.set(fs.s3n.impl, org.apache.hadoop.fs.s3native.NativeS3FileSystem) val f = sc.textFile(s3n://bucket/file) f.count No it can't find the

Re: Instantiating/starting Spark jobs programmatically

2015-04-23 Thread Anshul Singhle
Hi firemonk9, What you're doing looks interesting. Can you share some more details? Are you running the same spark context for each job, or are you running a seperate spark context for each job? Does your system need sharing of rdd's across multiple jobs? If yes, how do you implement that? Also

RE: Map Question

2015-04-23 Thread Ganelin, Ilya
You need to expose that variable the same way you'd expose any other variable in Python that you wanted to see across modules. As long as you share a spark context all will work as expected. http://stackoverflow.com/questions/142545/python-how-to-make-a-cross-module-variable Sent with Good

Spark + Hue

2015-04-23 Thread MrAsanjar .
Hi all Is there any good documentation on how to integrate spark with Hue 3.7.x? Is the only way to install spark Job Server? Thanks in advance for your help

Re: Trouble working with Spark-CSV package (error: object databricks is not a member of package com)

2015-04-23 Thread Mohammed Omer
Hm, no I don't have that in my path. However, someone on the spark-csv project advised that since I could not get another package/example to work, that this might be a Spark / Yarn issue: https://github.com/databricks/spark-csv/issues/54 Thoughts? I'll open a ticket later this afternoon if the

Re: A Spark Group by is running forever

2015-04-23 Thread ๏̯͡๏
I have seen multiple blogs stating to use reduceByKey instead of groupByKey. Could someone please help me in converting below code to use reduceByKey Code some spark processing ... Below val viEventsWithListingsJoinSpsLevelMetric:

dynamicAllocation spark-shell

2015-04-23 Thread Michael Stone
If I enable dynamicAllocation and then use spark-shell or pyspark, things start out working as expected: running simple commands causes new executors to start and complete tasks. If the shell is left idle for a while, executors start getting killed off: 15/04/23 10:52:43 INFO

Re: spark 1.3.1 : unable to access s3n:// urls (no file system for scheme s3n:)

2015-04-23 Thread Ted Yu
NativeS3FileSystem class is in hadoop-aws jar. Looks like it was not on classpath. Cheers On Thu, Apr 23, 2015 at 7:30 AM, Sujee Maniyam su...@sujee.net wrote: Thanks all... btw, s3n load works without any issues with spark-1.3.1-bulit-for-hadoop 2.4 I tried this on 1.3.1-hadoop26

Re: Instantiating/starting Spark jobs programmatically

2015-04-23 Thread Dean Wampler
I strongly recommend spawning a new process for the Spark jobs. Much cleaner separation. Your driver program won't be clobbered if the Spark job dies, etc. It can even watch for failures and restart. In the Scala standard library, the sys.process package has classes for constructing and

Tasks run only on one machine

2015-04-23 Thread Pat Ferrel
Using Spark streaming to create a large volume of small nano-batch input files, ~4k per file, thousands of ‘part-x’ files. When reading the nano-batch files and doing a distributed calculation my tasks run only on the machine where it was launched. I’m launching in “yarn-client” mode. The

[Spark Streaming] Help with updateStateByKey()

2015-04-23 Thread allonsy
Hi everybody, I think I could use some help with the /updateStateByKey()/ JAVA method in Spark Streaming. *Context:* I have a /JavaReceiverInputDStreamDataUpdate du/ DStream, where object /DataUpdate/ mainly has 2 fields of interest (in my case), namely du.personId (an Integer) and

Re: Tasks run only on one machine

2015-04-23 Thread Pat Ferrel
Sure var columns = mc.textFile(source).map { line = line.split(delimiter) } Here “source” is a comma delimited list of files or directories. Both the textFile and .map tasks happen only on the machine they were launched from. Later other distributed operations happen but I suspect if I can

Re: Tasks run only on one machine

2015-04-23 Thread Pat Ferrel
Argh, I looked and there really isn’t that much data yet. There will be thousands but starting small. I bet this is just a total data size not requiring all workers thing—sorry, nevermind. On Apr 23, 2015, at 10:30 AM, Pat Ferrel p...@occamsmachete.com wrote: They are in HDFS so available on

Slower performance when bigger memory?

2015-04-23 Thread Shuai Zheng
Hi All, I am running some benchmark on r3*8xlarge instance. I have a cluster with one master (no executor on it) and one slave (r3*8xlarge). My job has 1000 tasks in stage 0. R3*8xlarge has 244G memory and 32 cores. If I create 4 executors, each has 8 core+50G memory, each task

Re: Shuffle files not cleaned up (Spark 1.2.1)

2015-04-23 Thread N B
Thanks for the response, Conor. I tried with those settings and for a while it seemed like it was cleaning up shuffle files after itself. However, after exactly 5 hours later it started throwing exceptions and eventually stopped working again. A sample stack trace is below. What is curious about 5

Re: Bug? Can't reference to the column by name after join two DataFrame on a same name key

2015-04-23 Thread Yin Huai
Hi Shuai, You can use as to create a table alias. For example, df1.as(df1). Then you can use $df1.col to refer it. Thanks, Yin On Thu, Apr 23, 2015 at 11:14 AM, Shuai Zheng szheng.c...@gmail.com wrote: Hi All, I use 1.3.1 When I have two DF and join them on a same name key, after

Re: Tasks run only on one machine

2015-04-23 Thread Sean Owen
Where are the file splits? meaning is it possible they were also (only) available on one node and that was also your driver? On Thu, Apr 23, 2015 at 1:21 PM, Pat Ferrel p...@occamsmachete.com wrote: Sure var columns = mc.textFile(source).map { line = line.split(delimiter) } Here “source”

Re: Slower performance when bigger memory?

2015-04-23 Thread Ted Yu
Shuai: Please take a look at: http://blog.takipi.com/garbage-collectors-serial-vs-parallel-vs-cms-vs-the-g1-and-whats-new-in-java-8/ On Apr 23, 2015, at 10:18 AM, Dean Wampler deanwamp...@gmail.com wrote: JVM's often have significant GC overhead with heaps bigger than 64GB. You might try

Re: Tasks run only on one machine

2015-04-23 Thread Pat Ferrel
Physically? Not sure, they were written using the nano-batch rdds in a streaming job that is in a separate driver. The job is a Kafka consumer. Would that effect all derived rdds? If so is there something I can do to mix it up or does Spark know best about execution speed here? On Apr 23,

Re: Map Question

2015-04-23 Thread Vadim Bichutskiy
Here it is. How do I access a broadcastVar in a function that's in another module (process_stuff.py below): Thanks, Vadim main.py --- from pyspark import SparkContext, SparkConf from pyspark.streaming import StreamingContext from pyspark.sql import SQLContext from process_stuff import

Re: Map Question

2015-04-23 Thread Vadim Bichutskiy
Thanks Ilya. I am having trouble doing that. Can you give me an example? ᐧ On Thu, Apr 23, 2015 at 12:06 PM, Ganelin, Ilya ilya.gane...@capitalone.com wrote: You need to expose that variable the same way you'd expose any other variable in Python that you wanted to see across modules. As long

Re: Tasks run only on one machine

2015-04-23 Thread Jeetendra Gangele
Will you be able to paste code here? On 23 April 2015 at 22:21, Pat Ferrel p...@occamsmachete.com wrote: Using Spark streaming to create a large volume of small nano-batch input files, ~4k per file, thousands of 'part-x' files. When reading the nano-batch files and doing a distributed

Re: dynamicAllocation spark-shell

2015-04-23 Thread Cheolsoo Park
Hi, Attempted to request a negative number of executor(s) -663 from the cluster manager. Please specify a positive number! This is a bug in dynamic allocation. Here is the jira- https://issues.apache.org/jira/browse/SPARK-6954 Thanks! Cheolsoo On Thu, Apr 23, 2015 at 7:57 AM, Michael Stone

Re: Shuffle files not cleaned up (Spark 1.2.1)

2015-04-23 Thread Tathagata Das
What was the state of your streaming application? Was it falling behind with a large increasing scheduling delay? TD On Thu, Apr 23, 2015 at 11:31 AM, N B nb.nos...@gmail.com wrote: Thanks for the response, Conor. I tried with those settings and for a while it seemed like it was cleaning up

Non-Deterministic Graph Building

2015-04-23 Thread hokiegeek2
Hi Everyone, I am running into a really weird problem that only one other person has reported to the best of my knowledge (and the thread never yielded a resolution). I build a GraphX Graph from an input EdgeRDD and VertexRDD via the Graph(VertexRDD,EdgeRDD) constructor. When I execute

Question regarding join with multiple columns with pyspark

2015-04-23 Thread Ali Bajwa
Hi experts, Sorry if this is a n00b question or has already been answered... Am trying to use the data frames API in python to join 2 dataframes with more than 1 column. The example I've seen in the documentation only shows a single column - so I tried this: Example code import pandas

RE: Bug? Can't reference to the column by name after join two DataFrame on a same name key

2015-04-23 Thread Shuai Zheng
Got it. Thanks! J From: Yin Huai [mailto:yh...@databricks.com] Sent: Thursday, April 23, 2015 2:35 PM To: Shuai Zheng Cc: user Subject: Re: Bug? Can't reference to the column by name after join two DataFrame on a same name key Hi Shuai, You can use as to create a table alias. For

Re: why does groupByKey return RDD[(K, Iterable[V])] not RDD[(K, CompactBuffer[V])] ?

2015-04-23 Thread Hao Ren
Should I repost this to dev list ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/why-does-groupByKey-return-RDD-K-Iterable-V-not-RDD-K-CompactBuffer-V-tp22616p22640.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Getting error running MLlib example with new cluster

2015-04-23 Thread Su She
I had asked this question before, but wanted to ask again as I think it is related to my pom file or project setup. I have been trying on/off for the past month to try to run this MLlib example: - To unsubscribe, e-mail:

Pyspark where do third parties libraries need to be installed under Yarn-client mode

2015-04-23 Thread dusts66
I am trying to figure out python library management. So my question is: Where do third party Python libraries(ex. numpy, scipy, etc.) need to exist if I running a spark job via 'spark-submit' against my cluster in 'yarn client' mode. Do the libraries need to only exist on the client(ie. the

Getting error running MLlib example with new cluster

2015-04-23 Thread Su She
Sorry, accidentally sent the last email before finishing. I had asked this question before, but wanted to ask again as I think it is now related to my pom file or project setup. Really appreciate the help! I have been trying on/off for the past month to try to run this MLlib example:

gridsearch - python

2015-04-23 Thread Pagliari, Roberto
Can anybody point me to an example, if available, about gridsearch with python? Thank you,

Re: problem writing to s3

2015-04-23 Thread Daniel Mahler
Hi Akhil I can confirm that the problem goes away when jsonRaw and jsonClean are in different s3 buckets. thanks Daniel On Thu, Apr 23, 2015 at 1:27 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Can you try writing to a different S3 bucket and confirm that? Thanks Best Regards On Thu,

Re: Tasks run only on one machine

2015-04-23 Thread Pat Ferrel
They are in HDFS so available on all workers On Apr 23, 2015, at 10:29 AM, Pat Ferrel p...@occamsmachete.com wrote: Physically? Not sure, they were written using the nano-batch rdds in a streaming job that is in a separate driver. The job is a Kafka consumer. Would that effect all derived

Bug? Can't reference to the column by name after join two DataFrame on a same name key

2015-04-23 Thread Shuai Zheng
Hi All, I use 1.3.1 When I have two DF and join them on a same name key, after that, I can't get the common key by name. Basically: select * from t1 inner join t2 on t1.col1 = t2.col1 And I am using purely DataFrame, spark SqlContext not HiveContext DataFrame df3 =

Is the Spark-1.3.1 support build with scala 2.8 ?

2015-04-23 Thread guoqing0...@yahoo.com.hk
Is the Spark-1.3.1 support build with scala 2.8 ? Wether it can integrated with kafka_2.8.0-0.8.0 If build with scala 2.10 . Thanks.

Re: why does groupByKey return RDD[(K, Iterable[V])] not RDD[(K, CompactBuffer[V])] ?

2015-04-23 Thread Corey Nolet
If you return an iterable, you are not tying the API to a compactbuffer. Someday, the data could be fetched lazily and he API would not have to change. On Apr 23, 2015 6:59 PM, Dean Wampler deanwamp...@gmail.com wrote: I wasn't involved in this decision (I just make the fries), but

Re: Understanding Spark/MLlib failures

2015-04-23 Thread Burak Yavuz
Hi Andrew, I observed similar behavior under high GC pressure, when running ALS. What happened to me was that, there would be very long Full GC pauses (over 600 seconds at times). These would prevent the executors from sending heartbeats to the driver. Then the driver would think that the

Re: why does groupByKey return RDD[(K, Iterable[V])] not RDD[(K, CompactBuffer[V])] ?

2015-04-23 Thread Koert Kuipers
because CompactBuffer is considered an implementation detail. It is also not public for the same reason. On Thu, Apr 23, 2015 at 6:46 PM, Hao Ren inv...@gmail.com wrote: Should I repost this to dev list ? -- View this message in context:

Understanding Spark/MLlib failures

2015-04-23 Thread aleverentz
[My apologies if this is a re-post. I wasn't subscribed the first time I sent this message, and I'm hoping this second message will get through.] I’ve been using Spark 1.3.0 and MLlib for some machine learning tasks. In a fit of blind optimism, I decided to try running MLlib’s Principal

Re: Understanding Spark/MLlib failures

2015-04-23 Thread Reza Zadeh
Hi Andrew, The .principalComponents feature of RowMatrix is currently constrained to tall and skinny matrices. Your matrix is barely above the skinny requirement (10k columns), though the number of rows is fine. What are you looking to do with the principal components? If unnormalized PCA is OK

Re: why does groupByKey return RDD[(K, Iterable[V])] not RDD[(K, CompactBuffer[V])] ?

2015-04-23 Thread Dean Wampler
I wasn't involved in this decision (I just make the fries), but CompactBuffer is designed for relatively small data sets that at least fit in memory. It's more or less an Array. In principle, returning an iterator could hide the actual data structure that might be needed to hold a much bigger data

Re: setting cost in linear SVM [Python]

2015-04-23 Thread Xiangrui Meng
If by C you mean the parameter C in LIBLINEAR, the corresponding parameter in MLlib is regParam: https://github.com/apache/spark/blob/master/python/pyspark/mllib/classification.py#L273, while regParam = 1/C. -Xiangrui On Wed, Apr 22, 2015 at 3:25 PM, Pagliari, Roberto rpagli...@appcomsci.com

Re: Problem with using Spark ML

2015-04-23 Thread Staffan
So I got the tip of trying to reduce step-size and that finally gave some more decent results, had hoped for the default params to give at least OK results and thought that the problem must be somewhere else in the code. Problem solved! -- View this message in context:

Re: StackOverflow Error when run ALS with 100 iterations

2015-04-23 Thread Xiangrui Meng
ALS.setCheckpointInterval was added in Spark 1.3.1. You need to upgrade Spark to use this feature. -Xiangrui On Wed, Apr 22, 2015 at 9:03 PM, amghost zhengweita...@outlook.com wrote: Hi, would you please how to checkpoint the training set rdd since all things are done in ALS.train method.

Re: Distinct is very slow

2015-04-23 Thread Jeetendra Gangele
Anyone any thought on this? On 22 April 2015 at 22:49, Jeetendra Gangele gangele...@gmail.com wrote: I made 7000 tasks in mapTopair and in distinct also I made same number of tasks. Still lots of shuffle read and write is happening due to application running for much longer time. Any idea?

RE: Error in creating spark RDD

2015-04-23 Thread Sun, Rui
Hi, SparkContext.newAPIHadoopRDD() is for working with new Hadoop mapreduce API. So, you should import import org.apache.accumulo.core.client.mapreduce.AccumuloInputFormat; Instead of import org.apache.accumulo.core.client.mapred.AccumuloInputFormat; -Original Message- From: madhvi

Re: Spark RDD Lifecycle: whether RDD will be reclaimed out of scope

2015-04-23 Thread Prannoy
Hi, Yes, Spark automatically removes old RDDs from the cache when you make new ones. Unpersist forces it to remove them right away. On Thu, Apr 23, 2015 at 9:28 AM, Jeffery [via Apache Spark User List] ml-node+s1001560n22618...@n3.nabble.com wrote: Hi, Dear Spark Users/Devs: In a method, I

Pipeline in pyspark

2015-04-23 Thread Suraj Shetiya
Hi, I have come across ways of building pipeline of input/transform and output pipelines with Java (Google Dataflow/Spark etc). I also understand that Spark itelf provides ways for creating a pipeline within mlib for MLtransforms (primarily fit) Both of the above are available in Java/Scala

Re: IOUtils cannot write anything in Spark?

2015-04-23 Thread Holden Karau
It seems like saveAsTextFile might do what you are looking for. On Wednesday, April 22, 2015, Xi Shen davidshe...@gmail.com wrote: Hi, I have a RDD of some processed data. I want to write these files to HDFS, but not for future M/R processing. I want to write plain old style text file. I

IOUtils cannot write anything in Spark?

2015-04-23 Thread Xi Shen
Hi, I have a RDD of some processed data. I want to write these files to HDFS, but not for future M/R processing. I want to write plain old style text file. I tried: rdd foreach {d = val file = // create the file using a HDFS FileSystem val lines = d map { // format data into string }

Re: StandardScaler failing with OOM errors in PySpark

2015-04-23 Thread Rok Roskar
the feature dimension is 800k. yes, I believe the driver memory is likely the problem since it doesn't crash until the very last part of the tree aggregation. I'm running it via pyspark through YARN -- I have to run in client mode so I can't set spark.driver.memory -- I've tried setting the

spark yarn-cluster job failing in batch processing

2015-04-23 Thread sachin Singh
Hi All, I am trying to execute batch processing in yarn-cluster mode i.e. I have many sql insert queries,based on argument provided it will it will fetch the queries ,create context , schema RDD and insert in hive tables, Please Note- in standalone mode its working and in cluster mode working is

Is there a way to get the list of all jobs?

2015-04-23 Thread mkestemont
Hello, I am currently trying to monitor the progression of jobs. I created a class extending SparkListener, added a jobProgressListener to my sparkContext, and overrided the methods OnTaskStart, OnTaskEnd, OnJobStart and OnJobEnd, which leads to good results. Then, I would also like to monitor

Re: Pipeline in pyspark

2015-04-23 Thread ayan guha
I do not think you can share data across spark contexts. So as long as you can pass it around you should be good. On 23 Apr 2015 17:12, Suraj Shetiya surajshet...@gmail.com wrote: Hi, I have come across ways of building pipeline of input/transform and output pipelines with Java (Google

Re: Hive table creation - possible bug in Spark 1.3?

2015-04-23 Thread madhu phatak
Hi, Hive table creation need an extra step from 1.3. You can follow the following template df.registerTempTable(tableName) hc.sql(screate table $tableName as select * from $tableName) this will save the table in hive with given tableName. Regards, Madhukara Phatak

Re: Spark SQL performance issue.

2015-04-23 Thread ayan guha
Quick questions: why are you cache both rdd and table? Which stage of job is slow? On 23 Apr 2015 17:12, Nikolay Tikhonov tikhonovnico...@gmail.com wrote: Hi, I have Spark SQL performance issue. My code contains a simple JavaBean: public class Person implements Externalizable {

Re: Spark SQL performance issue.

2015-04-23 Thread Nikolay Tikhonov
why are you cache both rdd and table? I try to cache all the data to avoid the bad performance for the first query. Is it right? Which stage of job is slow? The query is run many times on one sqlContext and each query execution takes 1 second. 2015-04-23 11:33 GMT+03:00 ayan guha

A Spark Group by is running forever

2015-04-23 Thread ๏̯͡๏
I have a groupBy query after a map-side join leftOuterJoin. And this query is running for more than 2 hours. asks IndexIDAttemptStatusLocality LevelExecutor ID / HostLaunch TimeDurationGC TimeShuffle Read Size / RecordsWrite TimeShuffle Write Size / RecordsErrors 0 36 0 RUNNING PROCESS_LOCAL 17

Re: Custom paritioning of DSTream

2015-04-23 Thread davidkl
Hello Evo, Ranjitiyer, I am also looking for the same thing. Using foreach is not useful for me as processing the RDD as a whole won't be distributed across workers and that would kill performance in my application :-/ Let me know if you find a solution for this. Regards -- View this

RE: Custom paritioning of DSTream

2015-04-23 Thread Evo Eftimov
You can use transform which yields RDDs from the DStream as on each of the RDDs you can then apply partitionBy - transform also returns another DSTream while foreach doesn't Btw what do you mean re foreach killing the performance by not distributing the workload - every function (provided it

Re: Hive table creation - possible bug in Spark 1.3?

2015-04-23 Thread madhu phatak
Hi Michael, Here https://issues.apache.org/jira/browse/SPARK-7084 is the jira issue and PR https://github.com/apache/spark/pull/5654 for the same. Please have a look. Regards, Madhukara Phatak http://datamantra.io/ On Thu, Apr 23, 2015 at 1:22 PM, madhu phatak phatak@gmail.com wrote:

Re: Convert DStream to DataFrame

2015-04-23 Thread Sergio Jiménez Barrio
Thank you ver much, Tathagata! El miércoles, 22 de abril de 2015, Tathagata Das t...@databricks.com escribió: Aaah, that. That is probably a limitation of the SQLContext (cc'ing Yin for more information). On Wed, Apr 22, 2015 at 7:07 AM, Sergio Jiménez Barrio drarse.a...@gmail.com

Re: Spark SQL performance issue.

2015-04-23 Thread Arush Kharbanda
Hi Can you share your Web UI, depicting your task level breakup.I can see many thing s that can be improved. 1. JavaRDDPerson rdds = ...rdds.cache(); -this caching is not needed as you are not reading the rdd for any action 2.Instead of collecting as list, if you can save as text file, it

Re: Building Spark : Adding new DataType in Catalyst

2015-04-23 Thread zia_kayani
I've already tried UDT in Spark 1.2 and 1.3 but I encountered Kryo Serialization Exception on Joining as tracked here https://datastax-oss.atlassian.net/browse/SPARKC-23 , i've talked to Michael Armbrust https://plus.google.com/u/1/109154927192908362223/posts about the Exception, he said I'll

Re: problem with spark thrift server

2015-04-23 Thread Arush Kharbanda
Hi What do you mean disable the driver? what are you trying to achieve. Thanks Arush On Thu, Apr 23, 2015 at 12:29 PM, guoqing0...@yahoo.com.hk guoqing0...@yahoo.com.hk wrote: Hi , I have a question about spark thrift server , i deployed the spark on yarn and found if the spark driver

Re: Spark Streaming updatyeStateByKey throws OutOfMemory Error

2015-04-23 Thread Sourav Chandra
HI TD, Some observations: 1. If I submit the application using spark-submit tool with *client as deploy mode* it works fine with single master and worker (driver, master and worker are running in same machine) 2. If I submit the application using spark-submit tool with client as deploy mode it

Re: Trouble working with Spark-CSV package (error: object databricks is not a member of package com)

2015-04-23 Thread Krishna Sankar
Do you have commons-csv-1.1-bin.jar in your path somewhere ? I had to download and add this. Cheers k/ On Wed, Apr 22, 2015 at 11:01 AM, Mohammed Omer beancinemat...@gmail.com wrote: Afternoon all, I'm working with Scala 2.11.6, and Spark 1.3.1 built from source via: `mvn -Pyarn

Re: Error in creating spark RDD

2015-04-23 Thread madhvi
On Thursday 23 April 2015 12:22 PM, Akhil Das wrote: Here's a complete scala example https://github.com/bbux-proteus/spark-accumulo-examples/blob/1dace96a115f29c44325903195c8135edf828c86/src/main/scala/org/bbux/spark/AccumuloMetadataCount.scala Thanks Best Regards On Thu, Apr 23, 2015 at

Re: Re: HiveContext setConf seems not stable

2015-04-23 Thread guoqing0...@yahoo.com.hk
Hi all , My understanding for this problem is SQLConf will be overwrite by the hiveconfig in initialization phase when setConf(key: String, value: String) being called in the first time as below code snippets , so it is correctly in later. I`m not sure whether it is right , any point are

Is a higher-res or vector version of Spark logo available?

2015-04-23 Thread Enno Shioji
My employer (adform.com) would like to use the Spark logo in a recruitment event (to indicate that we are using Spark in our company). I looked in the Spark repo (https://github.com/apache/spark/tree/master/docs/img) but couldn't find a vector format. Is a higher-res or vector format version

Streaming Kmeans usage in java

2015-04-23 Thread Jeetendra Gangele
Do everyone do we have sample example how to use streaming k-means clustering with java. I have seen some example usage in scala. can anybody point me to the java example? regards jeetendra

How to start Thrift JDBC server as part of standalone spark application?

2015-04-23 Thread Vladimir Grigor
Hello, I would like to export RDD/DataFrames via JDBC SQL interface from the standalone application for currently stable Spark v1.3.1. I found one way of doing it but it requires the use of @DeveloperAPI method HiveThriftServer2.startWithContext(sqlContext) Is there a better, production level

Re: How to debug Spark on Yarn?

2015-04-23 Thread Ted Yu
For step 2, you can pipe application log to a file instead of copy-pasting. Cheers On Apr 22, 2015, at 10:48 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: I submit a spark app to YARN and i get these messages 15/04/22 22:45:04 INFO yarn.Client: Application report for

Re: MLlib - Collaborative Filtering - trainImplicit task size

2015-04-23 Thread Christian S. Perone
All these warnings come from ALS iterations, from flatMap and also from aggregate, for instance the origin of the state where the flatMap is showing these warnings (w/ Spark 1.3.0, they are also shown in Spark 1.3.1): org.apache.spark.rdd.RDD.flatMap(RDD.scala:296)

ML regression - spark context dies without error

2015-04-23 Thread jamborta
Hi all, I have been testing Spark ML algorithms with bigger dataset, and ran into some problems with linear regression: It seems the executors stop without any apparent reason: 15/04/22 20:15:05 INFO BlockManagerInfo: Added rdd_12492_80 in memory on backend-node:48037 (size: 28.5 MB, free: 2.8

Contributors, read me! Updated Contributing to Spark wiki

2015-04-23 Thread Sean Owen
Following several discussions about how to improve the contribution process in Spark, I've overhauled the guide to contributing. Anyone who is going to contribute needs to read it, as it has more formal guidance about the process:

Re: StandardScaler failing with OOM errors in PySpark

2015-04-23 Thread Rok Roskar
ok yes, I think I have narrowed it down to being a problem with driver memory settings. It looks like the application master/driver is not being launched with the settings specified: For the driver process on the main node I see -XX:MaxPermSize=128m -Xms512m -Xmx512m as options used to start the

Re: gridsearch - python

2015-04-23 Thread Punyashloka Biswal
https://issues.apache.org/jira/browse/SPARK-7022. Punya On Thu, Apr 23, 2015 at 5:47 PM Pagliari, Roberto rpagli...@appcomsci.com wrote: Can anybody point me to an example, if available, about gridsearch with python? Thank you,

RE: Re: problem with spark thrift server

2015-04-23 Thread Cheng, Hao
Hi, can you describe a little bit how the ThriftServer crashed, or steps to reproduce that? It’s probably a bug of ThriftServer. Thanks, From: guoqing0...@yahoo.com.hk [mailto:guoqing0...@yahoo.com.hk] Sent: Friday, April 24, 2015 9:55 AM To: Arush Kharbanda Cc: user Subject: Re: Re: problem

Re: Spark Streaming updatyeStateByKey throws OutOfMemory Error

2015-04-23 Thread Sourav Chandra
*bump* On Thu, Apr 23, 2015 at 3:46 PM, Sourav Chandra sourav.chan...@livestream.com wrote: HI TD, Some observations: 1. If I submit the application using spark-submit tool with *client as deploy mode* it works fine with single master and worker (driver, master and worker are running in

Re: Is the Spark-1.3.1 support build with scala 2.8 ?

2015-04-23 Thread madhu phatak
Hi, AFAIK it's only build with 2.10 and 2.11. You should integrate kafka_2.10.0-0.8.0 to make it work. Regards, Madhukara Phatak http://datamantra.io/ On Fri, Apr 24, 2015 at 9:22 AM, guoqing0...@yahoo.com.hk guoqing0...@yahoo.com.hk wrote: Is the Spark-1.3.1 support build with scala 2.8

RE: gridsearch - python

2015-04-23 Thread Pagliari, Roberto
I know grid search with cross validation is not supported. However, I was wondering if there is something availalable for the time being. Thanks, From: Punyashloka Biswal [mailto:punya.bis...@gmail.com] Sent: Thursday, April 23, 2015 9:06 PM To: Pagliari, Roberto; user@spark.apache.org

Re: Re: Is the Spark-1.3.1 support build with scala 2.8 ?

2015-04-23 Thread guoqing0...@yahoo.com.hk
Thank you very much for your suggestion. Regards, From: madhu phatak Date: 2015-04-24 13:06 To: guoqing0...@yahoo.com.hk CC: user Subject: Re: Is the Spark-1.3.1 support build with scala 2.8 ? Hi, AFAIK it's only build with 2.10 and 2.11. You should integrate kafka_2.10.0-0.8.0 to make it

Spark SQL - Setting YARN Classpath for primordial class loader

2015-04-23 Thread Night Wolf
Hi guys, Having a problem build a DataFrame in Spark SQL from a JDBC data source when running with --master yarn-client and adding the JDBC driver JAR with --jars. If I run with a local[*] master all works fine. ./bin/spark-shell --jars /tmp/libs/mysql-jdbc.jar --master yarn-client

Re: Spark SQL - Setting YARN Classpath for primordial class loader

2015-04-23 Thread Marcelo Vanzin
You'd have to use spark.{driver,executor}.extraClassPath to modify the system class loader. But that also means you have to manually distribute the jar to the nodes in your cluster, into a common location. On Thu, Apr 23, 2015 at 6:38 PM, Night Wolf nightwolf...@gmail.com wrote: Hi guys,

Re: Spark SQL - Setting YARN Classpath for primordial class loader

2015-04-23 Thread Night Wolf
Thanks Marcelo, can this be a path on HDFS? On Fri, Apr 24, 2015 at 11:52 AM, Marcelo Vanzin van...@cloudera.com wrote: You'd have to use spark.{driver,executor}.extraClassPath to modify the system class loader. But that also means you have to manually distribute the jar to the nodes in your

Re: Spark SQL - Setting YARN Classpath for primordial class loader

2015-04-23 Thread Marcelo Vanzin
No, those have to be local paths. On Thu, Apr 23, 2015 at 6:53 PM, Night Wolf nightwolf...@gmail.com wrote: Thanks Marcelo, can this be a path on HDFS? On Fri, Apr 24, 2015 at 11:52 AM, Marcelo Vanzin van...@cloudera.com wrote: You'd have to use spark.{driver,executor}.extraClassPath to

Re: Re: problem with spark thrift server

2015-04-23 Thread guoqing0...@yahoo.com.hk
Thanks for your reply , i would like to use Spark Thriftserver as JDBC SQL interface and the Spark application running on YARN . but the application was FINISHED when the Thriftserver crashed , all the cached table was lost . Thriftserver start command: start-thriftserver.sh --master yarn

spark 1.3.0 strange log message

2015-04-23 Thread Henry Hung
Dear All, When using spark 1.3.0 spark-submit with directing out and err to a log file, I saw some strange lines inside that looks like this: [Stage 0:(0 + 2) / 120] [Stage 0:(2 + 2) /

Re: spark 1.3.0 strange log message

2015-04-23 Thread Terry Hole
Use this in spark conf: spark.ui.showConsoleProgress=false Best Regards, On Fri, Apr 24, 2015 at 11:23 AM, Henry Hung ythu...@winbond.com wrote: Dear All, When using spark 1.3.0 spark-submit with directing out and err to a log file, I saw some strange lines inside that looks like this: