Re: Multiple Sources found for csv

2017-09-12 Thread jeff saremi
ot;com.databricks... ____ From: jeff saremi <jeffsar...@hotmail.com> Sent: Tuesday, September 12, 2017 3:38:00 PM To: user@spark.apache.org Subject: Multiple Sources found for csv I have this line which works in the spark interactive console but it fails in Intellij Using Sp

Multiple Sources found for csv

2017-09-12 Thread jeff saremi
I have this line which works in the spark interactive console but it fails in Intellij Using Spark 2.1.1 in both cases: Exception in thread "main" java.lang.RuntimeException: Multiple sources found for csv (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat,

Re: Continue reading dataframe from file despite errors

2017-09-12 Thread jeff saremi
thanks Suresh. it worked nicely From: Suresh Thalamati <suresh.thalam...@gmail.com> Sent: Tuesday, September 12, 2017 2:59:29 PM To: jeff saremi Cc: user@spark.apache.org Subject: Re: Continue reading dataframe from file despite errors Try the CSV Option

Re: Continue reading dataframe from file despite errors

2017-09-12 Thread jeff saremi
.scala:250) ____ From: jeff saremi <jeffsar...@hotmail.com> Sent: Tuesday, September 12, 2017 2:32:03 PM To: user@spark.apache.org Subject: Continue reading dataframe from file despite errors I'm using a statement like the following to load my dataframe from some text file Upon encoun

Continue reading dataframe from file despite errors

2017-09-12 Thread jeff saremi
I'm using a statement like the following to load my dataframe from some text file Upon encountering the first error, the whole thing throws an exception and processing stops. I'd like to continue loading even if that results in zero rows in my dataframe. How can i do that? thanks

Re: How can i remove the need for calling cache

2017-08-02 Thread jeff saremi
thanks Vadim. yes this is a good option for us. thanks From: Vadim Semenov <vadim.seme...@datadoghq.com> Sent: Wednesday, August 2, 2017 6:24:40 PM To: Suzen, Mehmet Cc: jeff saremi; user@spark.apache.org Subject: Re: How can i remove the need for calling

Re: How can i remove the need for calling cache

2017-08-02 Thread jeff saremi
as hoping for From: Vadim Semenov <vadim.seme...@datadoghq.com> Sent: Tuesday, August 1, 2017 12:05:17 PM To: jeff saremi Cc: user@spark.apache.org Subject: Re: How can i remove the need for calling cache You can use `.checkpoint()`: ``` val sc: SparkContext sc.setCheckpoint

Re: How can i remove the need for calling cache

2017-08-01 Thread jeff saremi
hen duplication is already minimized even without an explicit cache call. On Tue, Aug 1, 2017 at 11:05 AM, jeff saremi <jeffsar...@hotmail.com<mailto:jeffsar...@hotmail.com>> wrote: Calling cache/persist fails all our jobs (i have posted 2 threads on this). And we're giving up hope i

Re: How can i remove the need for calling cache

2017-08-01 Thread jeff saremi
Thanks Vadim. I'll try that From: Vadim Semenov <vadim.seme...@datadoghq.com> Sent: Tuesday, August 1, 2017 12:05:17 PM To: jeff saremi Cc: user@spark.apache.org Subject: Re: How can i remove the need for calling cache You can use `.checkpoint()`: ```

Re: How can i remove the need for calling cache

2017-08-01 Thread jeff saremi
effect as in my sample code without the use of cache(). If I use myrdd.count() would that be a good alternative? thanks From: lucas.g...@gmail.com <lucas.g...@gmail.com> Sent: Tuesday, August 1, 2017 11:23:04 AM To: jeff saremi Cc: user@spark.apache.org S

How can i remove the need for calling cache

2017-08-01 Thread jeff saremi
Calling cache/persist fails all our jobs (i have posted 2 threads on this). And we're giving up hope in finding a solution. So I'd like to find a workaround for that: If I save an RDD to hdfs and read it back, can I use it in more than one operation? Example: (using cache) // do a whole bunch

Re: Job keeps aborting because of org.apache.spark.shuffle.FetchFailedException: Failed to connect to server/ip:39232

2017-07-28 Thread jeff saremi
asking this on a tangent: Is there anyway for the shuffle data to be replicated to more than one server? thanks From: jeff saremi <jeffsar...@hotmail.com> Sent: Friday, July 28, 2017 4:38:08 PM To: Juan Rodríguez Hortalá Cc: user@spark.apache.org Subje

Re: Job keeps aborting because of org.apache.spark.shuffle.FetchFailedException: Failed to connect to server/ip:39232

2017-07-28 Thread jeff saremi
etwork.timeout=1000s ^ From: Juan Rodríguez Hortalá <juan.rodriguez.hort...@gmail.com> Sent: Friday, July 28, 2017 4:20:40 PM To: jeff saremi Cc: user@spark.apache.org Subject: Re: Job keeps aborting because of org.apache.spark.shuffle.FetchFailedException: Faile

Re: How to configure spark on Yarn cluster

2017-07-28 Thread jeff saremi
. Thank you. Yes this is the same problem however it looks like no one has come up with a solution for this problem yet From: yohann jardin <yohannjar...@hotmail.com> Sent: Friday, July 28, 2017 10:47:40 AM To: jeff saremi; user@spark.apache.org Subject: R

Job keeps aborting because of org.apache.spark.shuffle.FetchFailedException: Failed to connect to server/ip:39232

2017-07-28 Thread jeff saremi
We have a not too complex and not too large spark job that keeps dying with this error I have researched it and I have not seen any convincing explanation on why I am not using a shuffle service. Which server is the one that is refusing the connection? If I go to the server that is being

Re: How to configure spark on Yarn cluster

2017-07-28 Thread jeff saremi
From: yohann jardin <yohannjar...@hotmail.com> Sent: Thursday, July 27, 2017 11:15:39 PM To: jeff saremi; user@spark.apache.org Subject: Re: How to configure spark on Yarn cluster Check the executor page of the Spark UI, to check if your storage level is limiting. Also, i

How to configure spark on Yarn cluster

2017-07-28 Thread jeff saremi
I have the simplest job which i'm running against 100TB of data. The job keeps failing with ExecutorLostFailure's on containers killed by Yarn for exceeding memory limits I have varied the executor-memory from 32GB to 96GB, the spark.yarn.executor.memoryOverhead from 8192 to 36000 and similar

Re: How to list only erros for a stage

2017-07-25 Thread jeff saremi
Thank you. That helps From: 周康 <zhoukang199...@gmail.com> Sent: Monday, July 24, 2017 8:04:51 PM To: jeff saremi Cc: user@spark.apache.org Subject: Re: How to list only erros for a stage May be you can click Header Status cloumn of Task section,then faile

How to list only erros for a stage

2017-07-24 Thread jeff saremi
On the Spark status UI you can click Stages on the menu and see Active (and completed stages). For the active stage, you can see Succeeded/Total and a count of failed ones in paranthesis. I'm looking for a way to go straight to the failed tasks and list the errors. Currently I must go into

Re: Is there "EXCEPT ALL" in Spark SQL?

2017-07-06 Thread jeff saremi
EXCEPT is not the same as EXCEPT ALL Had they implemented EXCEPT ALL in SparkSQL one could have easily obtained EXCEPT by adding a disctint() to the results From: hareesh makam <makamhare...@gmail.com> Sent: Thursday, July 6, 2017 12:48:18 PM To: jeff sar

Is there "EXCEPT ALL" in Spark SQL?

2017-07-06 Thread jeff saremi
I tried this query in 1.6 and it failed: SELECT * FROM Table1 EXCEPT ALL SELECT * FROM Table2 Exception in thread "main" java.lang.RuntimeException: [1.32] failure: ``('' expected but `all' found thanks Jeff

Re: What is the equivalent of mapPartitions in SpqrkSQL?

2017-06-28 Thread jeff saremi
ery annoying as such forcing us to stay conservative and just make do without sql. I'm sure we're not alone here. From: Aaron Perrin <aper...@gravyanalytics.com> Sent: Tuesday, June 27, 2017 4:50:25 PM To: Ryan; jeff saremi Cc: user@spark.apache.org Subject: Re: What

Re: What is the equivalent of mapPartitions in SpqrkSQL?

2017-06-25 Thread jeff saremi
. From: Ryan <ryan.hd@gmail.com> Sent: Sunday, June 25, 2017 7:18:32 PM To: jeff saremi Cc: user@spark.apache.org Subject: Re: What is the equivalent of mapPartitions in SpqrkSQL? Why would you like to do so? I think there's no need for us to explicitly ask for a forEachPartition in spa

What is the equivalent of mapPartitions in SpqrkSQL?

2017-06-25 Thread jeff saremi
You can do a map() using a select and functions/UDFs. But how do you process a partition using SQL?

Re: Bizzare diff in behavior between scala REPL and sparkSQL UDF

2017-06-20 Thread jeff saremi
never mind! I has a space at the end of my data which was not showing up in manual testing. thanks From: jeff saremi <jeffsar...@hotmail.com> Sent: Tuesday, June 20, 2017 2:48:06 PM To: user@spark.apache.org Subject: Bizzare diff in behavior between scal

Bizzare diff in behavior between scala REPL and sparkSQL UDF

2017-06-20 Thread jeff saremi
I have this function which does a regex matching in scala. I test it in the REPL I get expected results. I use it as a UDF in sparkSQL i get completely incorrect results. Function: class UrlFilter (filters: Seq[String]) extends Serializable { val regexFilters = filters.map(new Regex(_))

Re: Spark-submit: where do --files go?

2017-01-19 Thread jeff saremi
Thanks Sidney From: Sidney Feiner <sidney.fei...@startapp.com> Sent: Thursday, January 19, 2017 9:52 AM To: jeff saremi Cc: user@spark.apache.org Subject: Re: Spark-submit: where do --files go? Every executor creates a directory with your submitted

Re: Spark-submit: where do --files go?

2017-01-19 Thread jeff saremi
i wish someone added this to the documentation From: jeff saremi <jeffsar...@hotmail.com> Sent: Thursday, January 19, 2017 9:56 AM To: Sidney Feiner Cc: user@spark.apache.org Subject: Re: Spark-submit: where do --files go? Thanks

Spark-submit: where do --files go?

2017-01-19 Thread jeff saremi
I'd like to know how -- From within Java/spark -- I can access the dependent files which i deploy using "--files" option on the command line?

RE: Access to broadcasted variable

2016-02-20 Thread jeff saremi
From: shixi...@databricks.com To: jeffsar...@hotmail.com CC: user@spark.apache.org The broadcasted object is serialized in driver and sent to the executors. And in the executor, it will deserialize the bytes to get the broadcasted object. On Fri, Feb 19, 2016 at 5:54 AM, jeff saremi <jeff

RE: Access to broadcasted variable

2016-02-19 Thread jeff saremi
could someone please comment on this? thanks From: jeffsar...@hotmail.com To: user@spark.apache.org Subject: Access to broadcasted variable Date: Thu, 18 Feb 2016 14:44:07 -0500 I'd like to know if the broadcasted object gets serialized when accessed by the executor during the execution

Access to broadcasted variable

2016-02-18 Thread jeff saremi
I'd like to know if the broadcasted object gets serialized when accessed by the executor during the execution of a task? I know that it gets serialized from the driver to the worker. This question is inside worker when executor JVM's are accessing it thanks Jeff

RE: SequenceFile and object reuse

2015-11-19 Thread jeff saremi
unnecessary overhead of creating Java objects. As you've pointed out, this is at the expense of making the code more verbose when caching. -Sandy On Fri, Nov 13, 2015 at 10:29 AM, jeff saremi <jeffsar...@hotmail.com> wrote: So we tried reading a sequencefile in Spark and realized that all our reco

SequenceFile and object reuse

2015-11-13 Thread jeff saremi
So we tried reading a sequencefile in Spark and realized that all our records have ended up becoming the same. THen one of us found this: Note: Because Hadoop's RecordReader class re-uses the same Writable object for each record, directly caching the returned RDD or directly passing it to an

RE: How to install a Spark Package?

2015-10-05 Thread jeff saremi
@spark.apache.org To: jeffsar...@hotmail.com Are you talking about package which is listed onhttp://spark-packages.org The package should come with installation instructions, right ? On Oct 4, 2015, at 8:55 PM, jeff saremi <jeffsar...@hotmail.com> wrote: So that it is available even in o

How to install a Spark Package?

2015-10-04 Thread jeff saremi
So that it is available even in offline mode? I can't seem to be able to find any notes on thatthanksjeff

How to make sense of Spark log entries

2015-10-03 Thread jeff saremi
There are executor logs and driver logs. Most of them are not intuitive enough to mean anything to us. Are there any notes, documents, talks on how to decipher these logs and troubleshoot our applications' performance as a result? thanks Jeff

pyspark question: create RDD from csr_matrix

2015-09-22 Thread jeff saremi
i've tried desperately to create an RDD from a matrix i have. Every combination failed. I have a sparse matrix returned from a call to dv = DictVectorizer()sv_tf = dv.fit_transform(tf) which is supposed to be a matrix of document terms and their frequencies. I need to convert this to