master die and worker registration failed with duplicated worker id

2015-10-19 Thread ZhuGe
Hi all:We met a serial of weir problem in our standalone cluster with 2 masters(zk election agent). Q1 :Firstly, we find the active master would lose leadership at some point and shutdown itself. [INFO 2015-10-17 13:00:15 (ClientCnxn.java:1083)] Client session timed out, have not heard from

Re: Spark handling parallel requests

2015-10-19 Thread Akhil Das
Yes, there are kafka consumers/producers for almost all the languages, you can read more over here https://cwiki.apache.org/confluence/display/KAFKA/Clients#Clients-PHP Here's a repo for the php version https://github.com/EVODelavega/phpkafka Thanks Best Regards On Sun, Oct 18, 2015 at 12:58 PM,

Re: In-memory computing and cache() in Spark

2015-10-19 Thread Igor Berman
Does ur iterations really submit job? I dont see any action there On Oct 17, 2015 00:03, "Jia Zhan" wrote: > Hi all, > > I am running Spark locally in one node and trying to sweep the memory size > for performance tuning. The machine has 8 CPUs and 16G main memory, the

best way to generate per key auto increment numerals after sorting

2015-10-19 Thread fahad shah
Hi I wanted to ask whats the best way to achieve per key auto increment numerals after sorting, for eg. : raw file: 1,a,b,c,1,1 1,a,b,d,0,0 1,a,b,e,1,0 2,a,e,c,0,0 2,a,f,d,1,0 post-output (the last column is the position number after grouping on first three fields and reverse sorting on last

Re: pyspark groupbykey throwing error: unpack requires a string argument of length 4

2015-10-19 Thread Jeff Zhang
Stacktrace would be helpful if you can provide that. On Mon, Oct 19, 2015 at 1:42 PM, fahad shah wrote: > Hi > > I am trying to do pair rdd's, group by the key assign id based on key. > I am using Pyspark with spark 1.3, for some reason, I am getting this > error that I

Re: pyspark groupbykey throwing error: unpack requires a string argument of length 4

2015-10-19 Thread fahad shah
Thanks Jeff, please find the stack trace below: Py4JJavaError Traceback (most recent call last) in () 1 BB = B.groupByKey() > 2 BB.take(1) C:\apps\dist\spark-1.3.1.2.2.7.1-0004\python\pyspark\rdd.py in take(self, num) 1222 1223 p =

spark streaming failing to replicate blocks

2015-10-19 Thread Eugen Cepoi
Hi, I am running spark streaming 1.4.1 on EMR (AMI 3.9) over YARN. The job is reading data from Kinesis and the batch size is of 30s (I used the same value for the kinesis checkpointing). In the executor logs I can see every 5 seconds a sequence of stacktraces indicating that the block

Spark executor on Mesos - how to set effective user id?

2015-10-19 Thread Eugene Chepurniy
Hi everyone! While we are trying to utilize Spark On Mesos cluster, we are facing an issue related to effective linux user id being used to start executors on Mesos slaves: all executors are trying to use driver's linux user id to start on Mesos slaves. Let me explain in detail: spark driver

Is one batch created by Streaming Context always equal to one RDD?

2015-10-19 Thread vaibhavrtk
-- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-one-batch-created-by-Streaming-Context-always-equal-to-one-RDD-tp25117.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

How to take user jars precedence over Spark jars

2015-10-19 Thread YiZhi Liu
I'm trying to read a Thrift object from SequenceFile, using elephant-bird's ThriftWritable. My code looks like val rawData = sc.sequenceFile[BooleanWritable, ThriftWritable[TrainingSample]](input) val samples = rawData.map { case (key, value) => { value.setConverter(classOf[TrainingSample])

RE: Spark Streaming - use the data in different jobs

2015-10-19 Thread Ewan Leith
Storing the data in HBase, Cassandra, or similar is possibly the right answer, the other option that can work well is re-publishing the data back into second queue on RabbitMQ, to be read again by the next job. Thanks, Ewan From: Oded Maimon [mailto:o...@scene53.com] Sent: 18 October 2015

Re: Incrementally add/remove vertices in GraphX

2015-10-19 Thread mas
Dear All, Any update regarding Graph Streaming, I want to update, i.e., add vertices and edges after creation of graph. Any suggestions or recommendations to do that. Thanks, -- View this message in context:

RE: Should I convert json into parquet?

2015-10-19 Thread Ewan Leith
As Jörn says, Parquet and ORC will get you really good compression and can be much faster. There also some nice additions around predicate pushdown which can be great if you've got wide tables. Parquet is obviously easier to use, since it's bundled into Spark. Using ORC is described here

Re: Should I convert json into parquet?

2015-10-19 Thread Adrian Tanase
For general data access of the pre-computed aggregates (group by) you’re better off with Parquet. I’d only choose JSON if I needed interop with another app stack / language that has difficulty accessing parquet (E.g. Bulk load into document db…). On a strategic level, both JSON and parquet are

Re: Spark executor on Mesos - how to set effective user id?

2015-10-19 Thread Jerry Lam
Can you try setting SPARK_USER at the driver? It is used to impersonate users at the executor. So if you have user setup for launching spark jobs on the executor machines, simply set it to that user name for SPARK_USER. There is another configuration that will prevents jobs being launched with

Re: Spark Streaming - use the data in different jobs

2015-10-19 Thread Adrian Tanase
+1 for re-publishing to pubsub if there is only transient value in the data. If you need to query the intermediate representation then you will need to use a database. Sharing RDDs in memory should be possible with projects like spark job server but I think that’s overkill in this scenario.

Re: Spark executor on Mesos - how to set effective user id?

2015-10-19 Thread SLiZn Liu
Hi Jerry, I think you are referring to --no-switch_user. =) chiling...@gmail.com>于2015年10月19日 周一21:05写道: > Can you try setting SPARK_USER at the driver? It is used to impersonate > users at the executor. So if you have user setup for launching spark jobs > on the executor machines, simply

Re: How does shuffle work in spark ?

2015-10-19 Thread shahid
@all i did partitionby using default hash partitioner on data [(1,data)(2,(data),(n,data)] the total data was approx 3.5 it showed shuffle write 50G and on next action e.g count it is showing shuffle read of 50 G. i don't understand this behaviour and i think the performance is getting slow with

Re: How to take user jars precedence over Spark jars

2015-10-19 Thread YiZhi Liu
Hi Ted, Unfortunately these two options cause following failure in my environment: (java.lang.RuntimeException: class org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback not org.apache.hadoop.security.GroupMappingServiceProvider,java.lang.RuntimeException:

Re: How to take user jars precedence over Spark jars

2015-10-19 Thread Ted Yu
Have you tried the following options ? --conf spark.driver.userClassPathFirst=true --conf spark.executor. userClassPathFirst=true Cheers On Mon, Oct 19, 2015 at 5:07 AM, YiZhi Liu wrote: > I'm trying to read a Thrift object from SequenceFile, using > elephant-bird's

Re: Spark SQL Thriftserver and Hive UDF in Production

2015-10-19 Thread Deenar Toraskar
Reece You can do the following. Start the spark-shell. Register the UDFs in the shell using sqlContext, then start the Thrift Server using startWithContext from the spark shell: https://github.com/apache/spark/blob/master/sql/hive-

[Spark MLlib] How to apply spark ml given models for questions with general background

2015-10-19 Thread Zhiliang Zhu
Dear All, I am new for spark ml. There is some project for me, for some given math model and I would like to get its optimized solution.It is very similar with spark mllib application. However, the key problem for me is that the given math model is not obviously belonging to the models ( as

Re: How to take user jars precedence over Spark jars

2015-10-19 Thread Ted Yu
Can you use: https://maven.apache.org/plugins/maven-shade-plugin/ to shade the dependencies unique to your project ? On Mon, Oct 19, 2015 at 7:47 AM, YiZhi Liu wrote: > Hi Ted, > > Unfortunately these two options cause following failure in my environment: > >

flattening a JSON data structure

2015-10-19 Thread nunomrc
Hi I am fairly new to Spark and I am trying to flatten the following structure: |-- provider: struct (nullable = true) ||-- accountId: string (nullable = true) ||-- contract: array (nullable = true) And then provider is: root

k-prototypes in MLLib?

2015-10-19 Thread Fernando Velasco
Hi everyone! I am a data scientist new to Spark and I am interested on clustering of mixed variables. I am more used to R, where there are implementations like Daysy, PAM, etc. It is true that dummy variables along with K-Means can perform a nice job on clustering mixed variables, but I find this

Re: Spark SQL Thriftserver and Hive UDF in Production

2015-10-19 Thread Todd Nist
>From tableau, you should be able to use the Initial SQL option to support this: So in Tableau add the following to the “Initial SQL” create function myfunc AS 'myclass' using jar 'hdfs:///path/to/jar'; HTH, Todd On Mon, Oct 19, 2015 at 11:22 AM, Deenar Toraskar

Re: Spark handling parallel requests

2015-10-19 Thread Adrian Tanase
To answer your specific question, you can’t push data to Kafka through a socket – you need a smart client library as the cluster setup is pretty advanced (also requires zookeeper). I bet there are php libraries for Kafka although after a quick search it seems they’re still pretty young. Also –

Re: best way to generate per key auto increment numerals after sorting

2015-10-19 Thread fahad shah
Thanks Davies, groupbykey was throwing up the error: unpack requires a string argument of length 4 interestingly, I replace that with the sortbykey (which i read also shuffles so that data for same key are on same partition) and it ran fine - wondering if this a bug on groupbykey for Spark 1.3?

Spark SQL Exception: Conf non-local session path expected to be non-null

2015-10-19 Thread YaoPau
I've connected Spark SQL to the Hive Metastore and currently I'm running SQL code via pyspark. Typically everything works fine, but sometimes after a long-running Spark SQL job I get the error below, and from then on I can no longer run Spark SQL commands. I still do have both my sc and my

Dynamic Allocation & Spark Streaming

2015-10-19 Thread robert towne
I have watched a few videos from Databricks/Andrew Or around the Spark 1.2 release and it seemed that dynamic allocation was not yet available for Spark Streaming. I now see SPARK-10955 which is tied to 1.5.2 and allows disabling of Spark

Dynamic Allocation & Spark Streaming

2015-10-19 Thread robert towne
I have watched a few videos from Databricks/Andrew Or around the Spark 1.2 release and it seemed that dynamic allocation was not yet available for Spark Streaming. I now see SPARK-10955 which is tied to 1.5.2 and allows disabling of Spark

Re: writing avro parquet

2015-10-19 Thread Alex Nastetsky
Figured it out ... needed to use saveAsNewAPIHadoopFile, but was trying to use it on myDF.rdd instead of converting it to a PairRDD first. On Mon, Oct 19, 2015 at 2:14 PM, Alex Nastetsky < alex.nastet...@vervemobile.com> wrote: > Using Spark 1.5.1, Parquet 1.7.0. > > I'm trying to write

Spark ML/MLib newbie question

2015-10-19 Thread George Paulson
I have a dataset that's relatively big, but easily fits in memory. I want to generate many different features for this dataset and then run L1 regularized Logistic Regression on the feature enhanced dataset. The combined features will easily exhaust memory. I was hoping there was a way that I

Re: PySpark + Streaming + DataFrames

2015-10-19 Thread Tathagata Das
RDD and DF are not compatible data types. So you cannot return a DF when you have to return an RDD. What rather you can do is return the underlying RDD of the dataframe by dataframe.rdd(). On Fri, Oct 16, 2015 at 12:07 PM, Jason White wrote: > Hi Ken, thanks for

Concurrency/Multiple Users

2015-10-19 Thread GooniesNeverSayDie
I am trying to connect to an Apache Spark 1.4 server with multiple users. Here is the issue in short form: Connection 1 specifies database test1 at connection time. Show tables shows test1 database tables. Connection 2 specifies database test2 at connection time. Show tables shows test2 database

Re: new 1.5.1 behavior - exception on executor throws ClassNotFound on driver

2015-10-19 Thread Ted Yu
Interesting. I wonder why existing tests didn't catch that: class UnserializableException extends Exception { ./core/src/test/scala/org/apache/spark/rpc/RpcEnvSuite.scala class DAGSchedulerSuiteDummyException extends Exception

Re: Concurrency/Multiple Users

2015-10-19 Thread Michael Armbrust
Unfortunately the implementation of SPARK-2087 didn't have enough tests and got broken in 1.4. In Spark 1.6 we will have a much more solid fix: https://github.com/apache/spark/commit/3390b400d04e40f767d8a51f1078fcccb4e64abd On Mon, Oct 19, 2015 at 2:13 PM, GooniesNeverSayDie

Re: How to put an object in cache for ever in Streaming

2015-10-19 Thread Tathagata Das
That should also get cleaned through the GC, though you may have to explicitly run GC periodically for faster clean up. RDDs are by definition distributed across executors in parts. When caches the RDD partitions are cached in memory across the executors. On Fri, Oct 16, 2015 at 6:15 PM, swetha

Re: Issue in spark batches

2015-10-19 Thread Tathagata Das
If cassandra is down, does saveToCassandra throw an exception? If it does, you can catch that exception and write your own logic to retry and/or no update. Once the foreachRDD function completes, that batch will be internally marked as completed. TD On Mon, Oct 19, 2015 at 5:48 AM, varun sharma

Re: Dynamic Allocation & Spark Streaming

2015-10-19 Thread Tathagata Das
Unfortunately the title on the JIRA is extremely confusing. I have fixed it. The reason why dynamic allocation does not work well with streaming is that the heuristic that is used to automatically scale up or down the number of executors works for the pattern of task schedules in batch jobs, not

Re: flattening a JSON data structure

2015-10-19 Thread Michael Armbrust
Quickfix is probably to use Seq[Row] instead of Array (the types that are returned are documented here: http://spark.apache.org/docs/latest/sql-programming-guide.html#data-types) Really though you probably want to be using explode. Perhaps something like this would help? import

Re: new 1.5.1 behavior - exception on executor throws ClassNotFound on driver

2015-10-19 Thread Lij Tapel
Wow thanks, that's a great place to start digging deeper. Would it be appropriate to file this on JIRA? It makes spark 1.5.1 a bit of a deal breaker for me but I wouldn't mind taking a shot at fixing it given some guidance On Mon, Oct 19, 2015 at 1:03 PM, Ted Yu wrote: >

Re: Is one batch created by Streaming Context always equal to one RDD?

2015-10-19 Thread Tathagata Das
Each DStream creates one RDD per batch. On Mon, Oct 19, 2015 at 4:39 AM, vaibhavrtk wrote: > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Is-one-batch-created-by-Streaming-Context-always-equal-to-one-RDD-tp25117.html >

Re: Spark 1.5 Streaming and Kinesis

2015-10-19 Thread Phil Kallos
I am currently trying a few code changes to see if I can squash this error. I have created https://issues.apache.org/jira/browse/SPARK-11193 to track progress, hope that is okay! In the meantime, can anyone confirm their ability to run the Kinesis-ASL example using Spark > 1.5.x ? Would be

Re: pyspark groupbykey throwing error: unpack requires a string argument of length 4

2015-10-19 Thread fahad shah
Thanks Davies, sure, I can share the code/data in pm - best fahad On Mon, Oct 19, 2015 at 10:52 AM, Davies Liu wrote: > Could you simplify the code a little bit so we can reproduce the failure? > (may also have some sample dataset if it depends on them) > > On Sun, Oct 18,

Re: How does shuffle work in spark ?

2015-10-19 Thread Adrian Tanase
I don’t know why it expands to 50 GB but it’s correct to see it both on the first operation (shuffled write) and on the next one (shuffled read). It’s the barrier between the 2 stages. -adrian From: shahid ashraf Date: Monday, October 19, 2015 at 9:53 PM To: Kartik Mathur, Adrian Tanase Cc:

Re: Spark Streaming scheduler delay VS driver.cores

2015-10-19 Thread Adrian Tanase
Bump on this question – does anyone know what is the effect of spark.driver.cores on the driver's ability to manage larger clusters? Any tips on setting a correct value? I’m running Spark streaming on Yarn / Hadoop 2.6 / Spark 1.5.1. Thanks, -adrian From: Adrian Tanase Date: Saturday, October

Re: Streaming of COAP Resources

2015-10-19 Thread Adrian Tanase
I’m not familiar with you COAP library but onStart is called only once. You’re only reading the value once when the custom receiver is initialized. You need to set-up a callback, poll a buffer — again, depends on your COAP client — In short configure your client to “start listening for changes”

Re: In-memory computing and cache() in Spark

2015-10-19 Thread Jia Zhan
Hi Igor, It iterative conducts reduce((a,b)*=>*a+b) which is the action there. I can see clearly 4 stages (one saveAsTextFile() and three Reduce()) in the web UI. Don't know what's going there that causes the non-intuitive caching behavior. Thanks for help! On Sun, Oct 18, 2015 at 11:32 PM,

Re: best way to generate per key auto increment numerals after sorting

2015-10-19 Thread Davies Liu
What's the issue with groupByKey()? On Mon, Oct 19, 2015 at 1:11 AM, fahad shah wrote: > Hi > > I wanted to ask whats the best way to achieve per key auto increment > numerals after sorting, for eg. : > > raw file: > > 1,a,b,c,1,1 > 1,a,b,d,0,0 > 1,a,b,e,1,0 > 2,a,e,c,0,0 >

Re: pyspark groupbykey throwing error: unpack requires a string argument of length 4

2015-10-19 Thread Davies Liu
Could you simplify the code a little bit so we can reproduce the failure? (may also have some sample dataset if it depends on them) On Sun, Oct 18, 2015 at 10:42 PM, fahad shah wrote: > Hi > > I am trying to do pair rdd's, group by the key assign id based on key. > I am

Storing Compressed data in HDFS into Spark

2015-10-19 Thread ahaider3
Hi, A lot of the data I have in HDFS is compressed. I noticed when I load this data into spark and cache it, Spark unrolls the data like normal but stores the data uncompressed in memory. For example, suppose /data/ is an RDD with compressed partitions on HDFS. I then cache the data. When I call

Re: new 1.5.1 behavior - exception on executor throws ClassNotFound on driver

2015-10-19 Thread Lij Tapel
I have verified that there is only 5.1.34 on the classpath. Funnily enough, I have a repro that doesn't even use mysql so this seems to be purely a classloader issue: source: http://pastebin.com/WMCMwM6T 1.4.1: http://pastebin.com/x38DQY2p 1.5.1: http://pastebin.com/DQd6k818 On Mon, Oct 19,

Re: pyspark: results differ based on whether persist() has been called

2015-10-19 Thread Davies Liu
This should be fixed by https://github.com/apache/spark/commit/a367840834b97cd6a9ecda568bb21ee6dc35fcde Will be released as 1.5.2 soon. On Mon, Oct 19, 2015 at 9:04 AM, peay2 wrote: > Hi, > > I am getting some very strange results, where I get different results based > on

Re: new 1.5.1 behavior - exception on executor throws ClassNotFound on driver

2015-10-19 Thread Ted Yu
The attachments didn't go through. Consider pastbebin'ning. Thanks On Mon, Oct 19, 2015 at 11:15 AM, gbop wrote: > I've been struggling with a particularly puzzling issue after upgrading to > Spark 1.5.1 from Spark 1.4.1. > > When I use the MySQL JDBC connector and an

Re: new 1.5.1 behavior - exception on executor throws ClassNotFound on driver

2015-10-19 Thread Ted Yu
Lij: jar tvf /Users/tyu/.m2/repository//mysql/mysql-connector-java/5.1.31/mysql-connector-java-5.1.31.jar | grep MySQLSyntaxErrorExceptio 914 Wed May 21 01:42:16 PDT 2014 com/mysql/jdbc/exceptions/MySQLSyntaxErrorException.class 842 Wed May 21 01:42:18 PDT 2014

Re: Differentiate Spark streaming in event logs

2015-10-19 Thread Adrian Tanase
You could try to start the 2/N jobs with a slightly different log4j template, by prepending some job type to all the messages... On 10/19/15, 9:47 PM, "franklyn" wrote: >Hi I'm running a job to collect some analytics on spark jobs by analyzing >their event logs. We

Re: How to calculate row by now and output retults in Spark

2015-10-19 Thread Ted Yu
Under core/src/test/scala/org/apache/spark , you will find a lot of examples for map function. FYI On Mon, Oct 19, 2015 at 10:35 AM, Shepherd wrote: > Hi all, I am new in Spark and Scala. I have a question in doing > calculation. I am using "groupBy" to generate key value

Differentiate Spark streaming in event logs

2015-10-19 Thread franklyn
Hi I'm running a job to collect some analytics on spark jobs by analyzing their event logs. We write the event logs to a single HDFS folder and then pick them up in another job. I'd like to differentiate between regular spark jobs and spark streaming jobs in the event logs, i was wondering if

How to calculate row by now and output retults in Spark

2015-10-19 Thread Shepherd
Hi all, I am new in Spark and Scala. I have a question in doing calculation.I am using "groupBy" to generate key value pair, and the value points to a subset of original RDD. The RDD has four columns, and each subset RDD may have different number of rows.For example, the original code like

writing avro parquet

2015-10-19 Thread Alex Nastetsky
Using Spark 1.5.1, Parquet 1.7.0. I'm trying to write Avro/Parquet files. I have this code: sc.hadoopConfiguration.set(ParquetOutputFormat.WRITE_SUPPORT_CLASS, classOf[AvroWriteSupport].getName) AvroWriteSupport.setSchema(sc.hadoopConfiguration, MyClass.SCHEMA$) myDF.write.parquet(outputPath)

new 1.5.1 behavior - exception on executor throws ClassNotFound on driver

2015-10-19 Thread gbop
I've been struggling with a particularly puzzling issue after upgrading to Spark 1.5.1 from Spark 1.4.1. When I use the MySQL JDBC connector and an exception (e.g. com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException) is thrown on the executor, I get a ClassNotFoundException on the driver,

Re: new 1.5.1 behavior - exception on executor throws ClassNotFound on driver

2015-10-19 Thread Lij Tapel
Sorry, here's the logs and source: The error I see in spark 1.5.1: http://pastebin.com/86K9WQ5f * full logs here: http://pastebin.com/dfysSh9E What I used to see in spark 1.4.1: http://pastebin.com/eK3AZQFx * full logs here: http://pastebin.com/iffSFFWW The source and build.sbt:

Re: How to calculate row by now and output retults in Spark

2015-10-19 Thread Adrian Tanase
Are you by any chance looking for reduceByKey? IF you’re trying to collapse all the values in V into an aggregate, that’s what you should be looking at. -adrian From: Ted Yu Date: Monday, October 19, 2015 at 9:16 PM To: Shepherd Cc: user Subject: Re: How to calculate row by now and output

Re: Spark handling parallel requests

2015-10-19 Thread tarek.abouzeid91
Thanks guys for your advice , i will have a look on the custom receivers , thanks again guys for your efforts   --  Best Regards, -- Tarek Abouzeid On Monday, October 19, 2015 6:50 PM, Adrian Tanase wrote: To answer your specific question, you can’t push data to

Re: PySpark + Streaming + DataFrames

2015-10-19 Thread Jason White
Ah, that makes sense then, thanks TD. The conversion from RDD -> DF involves a `.take(10)` in PySpark, even if you provide the schema, so I was avoiding back-and-forth conversions. I’ll see if I can create a ‘trusted’ conversion that doesn’t involve the `take`. --  Jason On October 19, 2015

java TwitterUtils.createStream() how create "user stream" ???

2015-10-19 Thread Andy Davidson
Hi I wrote a little prototype that created a ³public stream² now I want to convert it to read tweets for a large number of explicit users. I to create a ³user stream² or a ³site stream". According to the twitter developer doc I should be able to set the ³follows² parameter to a list of users I

Re: PySpark + Streaming + DataFrames

2015-10-19 Thread Tathagata Das
Yes, precisely! Also, for other folks who may read this, could reply back with the trusted conversion that worked for you (for a clear solution)? TD On Mon, Oct 19, 2015 at 3:08 PM, Jason White wrote: > Ah, that makes sense then, thanks TD. > > The conversion from RDD

Succinct experience

2015-10-19 Thread Younes Naguib
Hi all, Anyone has any experience with SuccinctRDD? Thanks, Younes

serialization error

2015-10-19 Thread daze5112
Hi having some problems with the piece of code I inherited: the error messages i get are: the code runs if i exclude the following line: any help appreciated. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/serialization-error-tp25131.html Sent

Re: serialization error

2015-10-19 Thread Ted Yu
Attachments didn't go through. Mind using pastebin to show the code / error ? Thanks On Mon, Oct 19, 2015 at 3:01 PM, daze5112 wrote: > Hi having some problems with the piece of code I inherited: > > > > > the error messages i get are: > > > the code runs if i

Re: Spark SQL Exception: Conf non-local session path expected to be non-null

2015-10-19 Thread Ted Yu
A brief search led me to ql/src/java/org/apache/hadoop/hive/ql/session/SessionState.java : private static final String HDFS_SESSION_PATH_KEY = "_hive.hdfs.session.path"; ... public static Path getHDFSSessionPath(Configuration conf) { SessionState ss = SessionState.get(); if (ss ==

Multiple Spark Streaming Jobs on Single Master

2015-10-19 Thread Augustus Hong
Hi All, Would it be possible to run multiple spark streaming jobs on a single master at the same time? I currently have one master node and several worker nodes in the standalone mode, and I used spark-submit to submit multiple spark streaming jobs. >From what I observed, it seems like only the

Filter RDD

2015-10-19 Thread Shepherd
Hi all, I have a very simple question. I have a RDD, saying r1, which contains 5 columns, with both string and Int. How can I get a sub RDD, based on a rule, that the second column equals to a string (s)? Thanks a lot. -- View this message in context:

Re: serialization error

2015-10-19 Thread Andy Huang
That particular line used an object which did not implement "Serializable" On Tue, Oct 20, 2015 at 9:01 AM, daze5112 wrote: > Hi having some problems with the piece of code I inherited: > > > > > the error messages i get are: > > > the code runs if i exclude the

Re: Multiple Spark Streaming Jobs on Single Master

2015-10-19 Thread Tathagata Das
You can set the max cores for the first submitted job such that it does not take all the resources from the master. See http://spark.apache.org/docs/latest/submitting-applications.html # Run on a Spark standalone cluster in client deploy mode ./bin/spark-submit \ --class

Re: Filter RDD

2015-10-19 Thread Ted Yu
See the filter() method: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L334 Cheers On Mon, Oct 19, 2015 at 4:27 PM, Shepherd wrote: > Hi all, > I have a very simple question. > I have a RDD, saying r1, which contains 5

Re: In-memory computing and cache() in Spark

2015-10-19 Thread Jia Zhan
Hi Sonal, I tried changing the size spark.executor.memory but noting changes. It seems when I run locally in one machine, the RDD is cached in driver memory instead of executor memory. Here is a related post online:

why the Rating(user: Int, product: Int, rating: Double)(in MLlib's ALS), the 'user' and 'product' must be Int?

2015-10-19 Thread futureage
hi, i am learning the MLlib's ALS, when i saw the case class Rating(user: Int, product: Int, rating: Double), i want to know why the 'user' and 'product' are Int, they are just id, Long or String is not ok? thanks -- View this message in context: