Re: A question about rdd transformation

2017-06-22 Thread Lionel Luffy
Now I found the root cause is a Wrapper class in AnyRef is not Serializable, but even though I changed it to implements Serializable. the 'rows' still cannot get data... Any suggestion? On Fri, Jun 23, 2017 at 10:56 AM, Lionel Luffy wrote: > Hi there, > I'm trying to do

A question about rdd transformation

2017-06-22 Thread Lionel Luffy
Hi there, I'm trying to do below action while it always return java.io.NotSerializableException in the shuffle task. I've checked that Array is serializable. how can I get the data of rdd in newRDD? step 1: val rdd: RDD[(AnyRef, Array[AnyRef]] {..} step2 : rdd

Re: Using Spark with Local File System/NFS

2017-06-22 Thread Michael Mior
If you put a * in the path, Spark will look for a file or directory named *. To read all the files in a directory, just remove the star. -- Michael Mior michael.m...@gmail.com On Jun 22, 2017 17:21, "saatvikshah1994" wrote: > Hi, > > I've downloaded and kept the same

Using Spark with Local File System/NFS

2017-06-22 Thread saatvikshah1994
Hi, I've downloaded and kept the same set of data files on all my cluster nodes, in the same absolute path - say /home/xyzuser/data/*. I am now trying to perform an operation(say open(filename).read()) on all these files in spark, but by passing local file paths. I was under the assumption that

Spark submit - org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner.run error

2017-06-22 Thread Kanagha Kumar
Hi, I am *intermittently* seeing this error while doing spark-submit for spark 2.0.2-scala 2.11 version. I see the same issue reported in https://issues.apache.org/jira/browse/SPARK-18343 and it seems to be RESOLVED. I can run successfully most of the time though. Hence I'm unsure if it is

Re: Exception which using ReduceByKeyAndWindow in Spark Streaming.

2017-06-22 Thread Tathagata Das
Unfortunately, I am out of ideas. I dont know whats going wrong. If you can, try using Structured Streaming. We are more active on the Structured streaming project. On Thu, Jun 22, 2017 at 4:07 PM, swetha kasireddy wrote: > Hi TD, > > I am still seeing this issue

Re: Exception which using ReduceByKeyAndWindow in Spark Streaming.

2017-06-22 Thread swetha kasireddy
Hi TD, I am still seeing this issue with any immuatble DataStructure. Any idea why this happens? I use scala.collection.immutable.List[String]) and my reduce and inverse reduce does the following. visitorSet1 ++visitorSet2 visitorSet1.filterNot(visitorSet2.contains(_) On Wed, Jun 7, 2017

Re: Merging multiple Pandas dataframes

2017-06-22 Thread Saatvik Shah
Hi Assaf, Thanks for your suggestion. I also found one other improvement which is to iteratively convert Pandas DFs to RDDs and take a union of those(similar to dataframes). Basically calling createDataFrame is heavy + checkpointing of DataFrames is a brand new feature. Instead create a huge

Re: Trouble with PySpark UDFs and SPARK_HOME only on EMR

2017-06-22 Thread Nicholas Chammas
Here’s a repro for a very similar issue where Spark hangs on the UDF, which I think is related to the SPARK_HOME issue. I posted the repro on the EMR forum , but in case you can’t access it: 1. I’m running EMR 5.6.0, Spark 2.1.1, and

Re: Flume DStream produces 0 records after HDFS node killed

2017-06-22 Thread N B
This issue got resolved. I was able to trace it to the fact that the driver program's pom.xml was pulling in Spark 2.1.1 which in turn was pulling in Hadoop 2.2.0. Explicitly adding dependencies on Hadoop libraries 2.7.3 resolves it. The following API in HDFS :

Unsubscribe

2017-06-22 Thread LisTree Team
LisTree Team - Big Data Training Team - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Unsubscribe

2017-06-22 Thread LisTree Team
LisTree Team - Big Data Training Team - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Trouble with PySpark UDFs and SPARK_HOME only on EMR

2017-06-22 Thread Nick Chammas
I’m seeing a strange issue on EMR which I posted about here . In brief, when I try to import a UDF I’ve defined, Python somehow fails to find Spark. This exact code works for me locally and works on our on-premises CDH

How does Spark deal with Data Skewness?

2017-06-22 Thread Sea aj
Hi everyone, I have read about some interesting ideas on how to manage skew but I was not sure if any of these techniques are being used in Spark 2.x versions or not? To name a few, "Salting the Data" and "Dynamic Repartitioning" are techniques introduced in Spark Summits. I am really curious to

Re: Re: Re: spark2.1 kafka0.10

2017-06-22 Thread lk_spark
thank you Kumar , I will try it later. 2017-06-22 lk_spark 发件人:Pralabh Kumar 发送时间:2017-06-22 20:20 主题:Re: Re: spark2.1 kafka0.10 收件人:"lk_spark" 抄送:"user.spark" It looks like your replicas for partition are getting failed. If

Re: Why my project has this kind of error ?

2017-06-22 Thread satish lalam
Minglei - You could check your jdk path and scala library setting in project structure. i.e., project view (alt + 1), and then pressing F4 to open Project structure... look under SDKs and Libraries. On Mon, Jun 19, 2017 at 10:54 PM, 张明磊 wrote: > Hello to all, > > Below

Re: Re: spark2.1 kafka0.10

2017-06-22 Thread Pralabh Kumar
It looks like your replicas for partition are getting failed. If u have more brokers , can u try increasing ,replicas ,just to make sure atleast one leader is always available. On Thu, Jun 22, 2017 at 10:34 AM, lk_spark wrote: > each topic have 5 partition , 2 replicas . > >

Re: Broadcasts & Storage Memory

2017-06-22 Thread Pralabh Kumar
Hi Broadcast variables definitely store in the spark.memory.storageFraction . 1 If we go into the code of TorrentBroadcast.scala and writeBlocks method and navigates to BlockManager to MemoryStore . Desearlization of the variables occures in unroll memory and then transferred to storage memory .