Re: Can spark handle this scenario?

2018-02-16 Thread Holden Karau
I'm not sure what you mean by it could be hard to serialize complex operations? Regardless I think the question is do you want to parallelize this on multiple machines or just one? On Feb 17, 2018 4:20 PM, "Lian Jiang" wrote: > Thanks Ayan. RDD may support map better

Re: Can spark handle this scenario?

2018-02-16 Thread Lian Jiang
Thanks Ayan. RDD may support map better than Dataset/DataFrame. However, it could be hard to serialize complex operation for Spark to execute in parallel. IMHO, spark does not fit this scenario. Hope this makes sense. On Fri, Feb 16, 2018 at 8:58 PM, ayan guha wrote: > **

Re: Can spark handle this scenario?

2018-02-16 Thread ayan guha
Hi Couple of suggestions: 1. Do not use Dataset, use Dataframe in this scenario. There is no benefit of dataset features here. Using Dataframe, you can write an arbitrary UDF which can do what you want to do. 2. In fact you do need dataframes here. You would be better off with RDD here. just

Re: Can spark handle this scenario?

2018-02-16 Thread ayan guha
** You do NOT need dataframes, I mean. On Sat, Feb 17, 2018 at 3:58 PM, ayan guha wrote: > Hi > > Couple of suggestions: > > 1. Do not use Dataset, use Dataframe in this scenario. There is no benefit > of dataset features here. Using Dataframe, you can write an

Re: Can spark handle this scenario?

2018-02-16 Thread Irving Duran
Do you only want to use Scala? Because otherwise, I think with pyspark and pandas read table you should be able to accomplish what you want to accomplish. Thank you, Irving Duran On 02/16/2018 06:10 PM, Lian Jiang wrote: > Hi, > > I have a user case: > > I want to download S stock data from

Java Heap Space Error

2018-02-16 Thread Vinay Muttineni
Hello, I am trying to debug a PySpark program and quite frankly, I am stumped. I see the following error in the logs. I verified the input parameters - all appear to be in order. Driver and executors appear to be proper - about 3MB of 7GB being used on each node. I do see that the DAG plan that

Can spark handle this scenario?

2018-02-16 Thread Lian Jiang
Hi, I have a user case: I want to download S stock data from Yahoo API in parallel using Spark. I have got all stock symbols as a Dataset. Then I used below code to call Yahoo API for each symbol: case class Symbol(symbol: String, sector: String) case class Tick(symbol: String, sector:

"Too Large DataFrame" shuffle Fetch Failed exception in Spark SQL (SPARK-16753) (SPARK-9862)(SPARK-5928)(TAGs - Spark SQL, Intermediate Level, Debug)

2018-02-16 Thread Ashutosh Ranjan
Hi All, My spark Configuration is following. spark = SparkSession.builder.master(mesos_ip) \ .config('spark.executor.cores','3')\ .config('spark.executor.memory','8g')\ .config('spark.es.scroll.size','1')\ .config('spark.network.timeout','600s')\

Does the classloader used by spark blocks the I/O calls from UDF's?

2018-02-16 Thread kant kodali
Hi All, Does the class loader used by spark blocks the I/O calls from UDF's? If not, For security reasons wouldn't it make sense to block I/O calls within the UDF code? Thanks!

Re: [spark-sql] Custom Query Execution listener via conf properties

2018-02-16 Thread Marcelo Vanzin
According to https://issues.apache.org/jira/browse/SPARK-19558 this feature was added in 2.3. On Fri, Feb 16, 2018 at 12:43 AM, kurian vs wrote: > Hi, > > I was trying to create a custom Query execution listener by extending the >

[spark-sql] Custom Query Execution listener via conf properties

2018-02-16 Thread kurian vs
Hi, I was trying to create a custom Query execution listener by extending the org.apache.spark.sql.util.QueryExecutionListener class. My custom listener just contains some logging statements. But i do not see those logging statements when i run a spark job. Here are the steps that i did: