Spark2.4 json Jackson errors

2021-04-13 Thread KhajaAsmath Mohammed
Hi,

I am having issue when running custom
Applications on spark2.4. I was able to
Run successfully on windows ide but cannot run this in emr spark2.4. I get 
jsonmethods not found error.

I have included json4s in Uber jar still I get this error. Any solution to 
resolve this? 

Thanks,
Asmath
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark Session error with 30s

2021-04-13 Thread KhajaAsmath Mohammed
I was able to Resolve this by changing the Hdfs-site.xml as I mentioned in my 
initial thread

Thanks,
Asmath

> On Apr 12, 2021, at 8:35 PM, Peng Lei  wrote:
> 
> 
> Hi KhajaAsmath Mohammed
>   Please check the configuration of "spark.speculation.interval", just pass 
> the "30" to it.
>   
>  '''
>   override def start(): Unit = {
>   backend.start()
> 
>   if (!isLocal && conf.get(SPECULATION_ENABLED)) {
> logInfo("Starting speculative execution thread")
> speculationScheduler.scheduleWithFixedDelay(
>   () => Utils.tryOrStopSparkContext(sc) { checkSpeculatableTasks() },
>   SPECULATION_INTERVAL_MS, SPECULATION_INTERVAL_MS, TimeUnit.MILLISECONDS)
>   }
> }
>  '''
>   
> 
> Sean Owen  于2021年4月13日周二 上午3:30写道:
>> Something is passing this invalid 30s value, yes. Hard to say which property 
>> it is. I'd check if your cluster config sets anything with the value 30s - 
>> whatever is reading this property is not expecting it. 
>> 
>>> On Mon, Apr 12, 2021, 2:25 PM KhajaAsmath Mohammed 
>>>  wrote:
>>> Hi Sean,
>>> 
>>> Do you think anything that can cause this with DFS client?
>>> 
>>> java.lang.NumberFormatException: For input string: "30s"
>>> at 
>>> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>>> at java.lang.Long.parseLong(Long.java:589)
>>> at java.lang.Long.parseLong(Long.java:631)
>>> at 
>>> org.apache.hadoop.conf.Configuration.getLong(Configuration.java:1429)
>>> at 
>>> org.apache.hadoop.hdfs.client.impl.DfsClientConf.(DfsClientConf.java:247)
>>> at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:301)
>>> at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:285)
>>> at 
>>> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:160)
>>> at 
>>> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2859)
>>> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99)
>>> at 
>>> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2896)
>>> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2878)
>>> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:392)
>>> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:184)
>>> at 
>>> org.apache.spark.deploy.yarn.Client$$anonfun$8.apply(Client.scala:137)
>>> at 
>>> org.apache.spark.deploy.yarn.Client$$anonfun$8.apply(Client.scala:137)
>>> at scala.Option.getOrElse(Option.scala:121)
>>> at org.apache.spark.deploy.yarn.Client.(Client.scala:137)
>>> at 
>>> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
>>> at 
>>> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:183)
>>> at org.apache.spark.SparkContext.(SparkContext.scala:501)
>>> at 
>>> org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520)
>>> at 
>>> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:936)
>>> at 
>>> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession
>>> 
>>> Thanks,
>>> Asmath
>>> 
 On Mon, Apr 12, 2021 at 2:20 PM KhajaAsmath Mohammed 
  wrote:
 I am using spark hbase connector provided by hortonwokrs. I was able to 
 run without issues in my local environment and has this issue in emr. 
 
 Thanks,
 Asmath
 
>> On Apr 12, 2021, at 2:15 PM, Sean Owen  wrote:
>> 
> 
> Somewhere you're passing a property that expects a number, but give it 
> "30s". Is it a time property somewhere that really just wants MS or 
> something? But most time properties (all?) in Spark should accept that 
> type of input anyway. Really depends on what property has a problem and 
> what is setting it.
> 
>> On Mon, Apr 12, 2021 at 1:56 PM KhajaAsmath Mohammed 
>>  wrote:
>> HI,
>> 
>> I am getting weird error when running spark job in emr cluster. Same 
>> program runs in my local machine. Is there anything that I need to do to 
>> resolve this?
>> 
>> 21/04/12 18:48:45 ERROR SparkContext: Error initializing SparkContext.
>> java.lang.NumberFormatException: For input string: "30s"
>> 
>> I tried the solution mentioned in the link below but it didn't work for 
>> me.
>> 
>> https://hadooptutorials.info/2020/10/11/part-5-using-spark-as-execution-engine-for-hive-2/
>> 
>> Thanks,
>> Asmath


Re: [Spark Core][Advanced]: Problem with data locality when running Spark query with local nature on apache Hadoop

2021-04-13 Thread Russell Spitzer
scala> def getRootRdd( rdd:RDD[_] ): RDD[_]  = { if (rdd.dependencies.size == 
0) rdd else getRootRdd(rdd.dependencies(0).rdd)}
getRootRdd: (rdd: org.apache.spark.rdd.RDD[_])org.apache.spark.rdd.RDD[_]

scala> val rdd = spark.read.parquet("/Users/russellspitzer/Temp/local").rdd
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[38] 
at rdd at :24

scala> val scan = getRootRdd(rdd)
scan: org.apache.spark.rdd.RDD[_] = FileScanRDD[33] at rdd at :24

scala> scan.partitions.map(scan.preferredLocations)
res8: Array[Seq[String]] = Array(WrappedArray(), WrappedArray(), 
WrappedArray(), WrappedArray(), WrappedArray(), WrappedArray(), WrappedArray(), 
WrappedArray(), WrappedArray(), WrappedArray(), WrappedArray())

I define a quick traversal to get the source RDD for the dataframe operation. I 
make the read datafrarne and get the RDD out of it. I traverse the RDD's 
dependencies to get the FileScan. I then apply the scan's preferredLocations 
methods to each partition. You can see the result here is that none of my 
partitions have a preferred location so they will all be run at "Any". This is 
because I'm using my local file system which never reports a preferred location 
so even though the scheduler will report "ANY" in this case they are actually 
node local.


> On Apr 13, 2021, at 8:37 AM, Mohamadreza Rostami 
>  wrote:
> 
> Thanks for your response.
> I think my HDFS-spark cluster is co-localized because I have a spark worker 
> per each datanode; in other words, I installed the spark workers on 
> datanodes, and I think that's the point that why this simple query on a 
> co-localized HDFS-spark cluster run in "Any" locality level?
> Is there any way to figure out which IP or hostname of data-nodes returns 
> from name-node to the spark? or Can you offer me a debug approach?
> 
>> On Farvardin 24, 1400 AP, at 17:45, Russell Spitzer 
>> mailto:russell.spit...@gmail.com>> wrote:
>> 
>> Data locality can only occur if the Spark Executor IP address string matches 
>> the preferred location returned by the file system. So this job would only 
>> have local tasks if the datanode replicas for the files in question had the 
>> same ip address as the Spark executors you are using. If they don't then the 
>> scheduler falls back to assigning read tasks to the first executor available 
>> with locality level "any". 
>> 
>> So unless you have that HDFS - Spark Cluster co-localization I wouldn't 
>> expect this job to run at any other locality level than ANY.
>> 
>>> On Apr 13, 2021, at 3:47 AM, Mohamadreza Rostami 
>>> mailto:mohamadrezarosta...@gmail.com>> 
>>> wrote:
>>> 
>>> I have a Hadoop cluster that uses Apache Spark to query parquet files saved 
>>> on Hadoop. For example, when i'm using the following PySpark code to find a 
>>> word in parquet files:
>>> df = spark.read.parquet("hdfs://test/parquets/* ")
>>> df.filter(df['word'] == "jhon").show()
>>> After running this code, I go to spark application UI, stages tab, I see 
>>> that locality level summery set on Any. In contrast, because of this 
>>> query's nature, it must run locally and on NODE_LOCAL locality level at 
>>> least. When I check the network IO of the cluster while running this, I 
>>> find out that this query use network (network IO increases while the query 
>>> is running). The strange part of this situation is that the number shown in 
>>> the spark UI's shuffle section is very small.
>>> How can I find out the root cause of this problem and solve that?
>>> link of stackoverflow.com  : 
>>> https://stackoverflow.com/questions/66612906/problem-with-data-locality-when-running-spark-query-with-local-nature-on-apache
>>>  
>>> 
> 



Re: [Spark Core][Advanced]: Problem with data locality when running Spark query with local nature on apache Hadoop

2021-04-13 Thread Russell Spitzer
Data locality can only occur if the Spark Executor IP address string matches 
the preferred location returned by the file system. So this job would only have 
local tasks if the datanode replicas for the files in question had the same ip 
address as the Spark executors you are using. If they don't then the scheduler 
falls back to assigning read tasks to the first executor available with 
locality level "any". 

So unless you have that HDFS - Spark Cluster co-localization I wouldn't expect 
this job to run at any other locality level than ANY.

> On Apr 13, 2021, at 3:47 AM, Mohamadreza Rostami 
>  wrote:
> 
> I have a Hadoop cluster that uses Apache Spark to query parquet files saved 
> on Hadoop. For example, when i'm using the following PySpark code to find a 
> word in parquet files:
> df = spark.read.parquet("hdfs://test/parquets/* ")
> df.filter(df['word'] == "jhon").show()
> After running this code, I go to spark application UI, stages tab, I see that 
> locality level summery set on Any. In contrast, because of this query's 
> nature, it must run locally and on NODE_LOCAL locality level at least. When I 
> check the network IO of the cluster while running this, I find out that this 
> query use network (network IO increases while the query is running). The 
> strange part of this situation is that the number shown in the spark UI's 
> shuffle section is very small.
> How can I find out the root cause of this problem and solve that?
> link of stackoverflow.com  : 
> https://stackoverflow.com/questions/66612906/problem-with-data-locality-when-running-spark-query-with-local-nature-on-apache
>  
> 


[Spark Core][Advanced]: Problem with data locality when running Spark query with local nature on apache Hadoop

2021-04-13 Thread Mohamadreza Rostami
I have a Hadoop cluster that uses Apache Spark to query parquet files saved on 
Hadoop. For example, when i'm using the following PySpark code to find a word 
in parquet files:
df = spark.read.parquet("hdfs://test/parquets/*")
df.filter(df['word'] == "jhon").show()
After running this code, I go to spark application UI, stages tab, I see that 
locality level summery set on Any. In contrast, because of this query's nature, 
it must run locally and on NODE_LOCAL locality level at least. When I check the 
network IO of the cluster while running this, I find out that this query use 
network (network IO increases while the query is running). The strange part of 
this situation is that the number shown in the spark UI's shuffle section is 
very small.
How can I find out the root cause of this problem and solve that?
link of stackoverflow.com : 
https://stackoverflow.com/questions/66612906/problem-with-data-locality-when-running-spark-query-with-local-nature-on-apache