unsubscribe

2023-09-13 Thread randy clinton
unsubscribe

-- 
I appreciate your time,

~Randy

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



unsubscribe

2022-07-14 Thread randy clinton
-- 
I appreciate your time,

~Randy


Re: Hey good looking toPandas () error stack

2020-06-21 Thread randy clinton
You can see from the GitHub history for "toPandas()" that the function has
been in the code for 5 years.
https://github.com/apache/spark/blame/a075cd5b700f88ef447b559c6411518136558d78/python/pyspark/sql/dataframe.py#L923

When I google IllegalArgumentException: 'Unsupported class file major
version 55'

I see posts about the Java version being used. Are you sure your configs
are right?
https://stackoverflow.com/questions/53583199/pyspark-error-unsupported-class-file-major-version

On Sat, Jun 20, 2020 at 6:17 AM Anwar AliKhan 
wrote:

>
> Two versions of Spark running against same code
>
>
> https://towardsdatascience.com/your-first-apache-spark-ml-model-d2bb82b599dd
>
> version spark-2.4.6-bin-hadoop2.7 is producing error for toPandas(). See
> error stack below
>
> Jupyter Notebook
>
> import findspark
>
> findspark.init('/home/spark-3.0.0-bin-hadoop2.7')
>
> cell "spark"
>
> cell output
>
> SparkSession - in-memory
>
> SparkContext
>
> Spark UI
>
> Version
>
> v3.0.0
>
> Master
>
> local[*]
>
> AppName
>
> Titanic Data
>
>
> import findspark
>
> findspark.init('/home/spark-2.4.6-bin-hadoop2.7')
>
> cell  "spark"
>
>
>
> cell output
>
> SparkSession - in-memory
>
> SparkContext
>
> Spark UI
>
> Version
>
> v2.4.6
>
> Master
>
> local[*]
>
> AppName
>
> Titanic Data
>
> cell "df.show(5)"
>
>
> +---++--++--+---+-+-++---+-++
>
> |PassengerId|Survived|Pclass|Name|   Sex|Age|SibSp|Parch|
> Ticket|   Fare|Cabin|Embarked|
>
>
> +---++--++--+---+-+-++---+-++
>
> |  1|   0| 3|Braund, Mr. Owen ...|  male| 22|1|0|
>   A/5 21171|   7.25| null|   S|
>
> |  2|   1| 1|Cumings, Mrs. Joh...|female| 38|1|0|
>   PC 17599|71.2833|  C85|   C|
>
> |  3|   1| 3|Heikkinen, Miss. ...|female| 26|0|
> 0|STON/O2. 3101282|  7.925| null|   S|
>
> |  4|   1| 1|Futrelle, Mrs. Ja...|female| 35|1|0|
> 113803|   53.1| C123|   S|
>
> |  5|   0| 3|Allen, Mr. Willia...|  male| 35|0|0|
> 373450|   8.05| null|   S|
>
>
> +---++--++--+---+-+-++---+-++
>
> only showing top 5 rows
>
> cell "df.toPandas()"
>
> cell output
>
> ---
>
> Py4JJavaError Traceback (most recent call last)
>
> /home/spark-2.4.6-bin-hadoop2.7/python/pyspark/sql/utils.py in deco(*a,
> **kw)
>
>  62 try:
>
> ---> 63 return f(*a, **kw)
>
>  64 except py4j.protocol.Py4JJavaError as e:
>
> /home/spark-2.4.6-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py
> in get_return_value(answer, gateway_client, target_id, name)
>
> 327 "An error occurred while calling {0}{1}{2}.\n".
>
> --> 328 format(target_id, ".", name), value)
>
> 329 else:
>
> Py4JJavaError: An error occurred while calling o33.collectToPython.
>
> : java.lang.IllegalArgumentException: Unsupported class file major version
> 55
>
> at org.apache.xbean.asm6.ClassReader.(ClassReader.java:166)
>
> at org.apache.xbean.asm6.ClassReader.(ClassReader.java:148)
>
> at org.apache.xbean.asm6.ClassReader.(ClassReader.java:136)
>
> at org.apache.xbean.asm6.ClassReader.(ClassReader.java:237)
>
> at
> org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:50)
>
> at
> org.apache.spark.util.FieldAccessFinder$$anon$4$$anonfun$visitMethodInsn$7.apply(ClosureCleaner.scala:845)
>
> at
> org.apache.spark.util.FieldAccessFinder$$anon$4$$anonfun$visitMethodInsn$7.apply(ClosureCleaner.scala:828)
>
> at
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
>
> at
> scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:134)
>
> at
> scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:134)
>
> at
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236)
>
> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
>
> at scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:134)
>
> at
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
>
> at
> org.apache.spark.util.FieldAccessFinder$$anon$4.visitMethodInsn(ClosureCleaner.scala:828)
>
> at org.apache.xbean.asm6.ClassReader.readCode(ClassReader.java:2175)
>
> at org.apache.xbean.asm6.ClassReader.readMethod(ClassReader.java:1238)
>
> at org.apache.xbean.asm6.ClassReader.accept(ClassReader.java:631)
>
> at org.apache.xbean.asm6.ClassReader.accept(ClassReader.java:355)
>
> at
> 

Re: Spark dataframe hdfs vs s3

2020-05-29 Thread randy clinton
HDFS is simply a better place to make performant reads and on top of that
the data is closer to your spark job. The databricks link from above will
show you that where they find a 6x read throughput difference between the
two.

If your HDFS is part of the same Spark cluster than it should be an
incredibly fast read vs reaching out to S3 for the data.

They are different types of storage solving different things.

Something I have seen in workflows is something other people have suggested
above, is a stage where you load data from S3 into HDFS, then move on to
you other work with it and maybe finally persist outside of HDFS.

On Fri, May 29, 2020 at 2:09 PM Bin Fan  wrote:

> Try to deploy Alluxio as a caching layer on top of S3, providing Spark a
> similar HDFS interface?
> Like in this article:
>
> https://www.alluxio.io/blog/accelerate-spark-and-hive-jobs-on-aws-s3-by-10x-with-alluxio-tiered-storage/
>
>
> On Wed, May 27, 2020 at 6:52 PM Dark Crusader <
> relinquisheddra...@gmail.com> wrote:
>
>> Hi Randy,
>>
>> Yes, I'm using parquet on both S3 and hdfs.
>>
>> On Thu, 28 May, 2020, 2:38 am randy clinton, 
>> wrote:
>>
>>> Is the file Parquet on S3 or is it some other file format?
>>>
>>> In general I would assume that HDFS read/writes are more performant for
>>> spark jobs.
>>>
>>> For instance, consider how well partitioned your HDFS file is vs the S3
>>> file.
>>>
>>> On Wed, May 27, 2020 at 1:51 PM Dark Crusader <
>>> relinquisheddra...@gmail.com> wrote:
>>>
>>>> Hi Jörn,
>>>>
>>>> Thanks for the reply. I will try to create a easier example to
>>>> reproduce the issue.
>>>>
>>>> I will also try your suggestion to look into the UI. Can you guide on
>>>> what I should be looking for?
>>>>
>>>> I was already using the s3a protocol to compare the times.
>>>>
>>>> My hunch is that multiple reads from S3 are required because of
>>>> improper caching of intermediate data. And maybe hdfs is doing a better job
>>>> at this. Does this make sense?
>>>>
>>>> I would also like to add that we built an extra layer on S3 which might
>>>> be adding to even slower times.
>>>>
>>>> Thanks for your help.
>>>>
>>>> On Wed, 27 May, 2020, 11:03 pm Jörn Franke, 
>>>> wrote:
>>>>
>>>>> Have you looked in Spark UI why this is the case ?
>>>>> S3 Reading can take more time - it depends also what s3 url you are
>>>>> using : s3a vs s3n vs S3.
>>>>>
>>>>> It could help after some calculation to persist in-memory or on HDFS.
>>>>> You can also initially load from S3 and store on HDFS and work from there 
>>>>> .
>>>>>
>>>>> HDFS offers Data locality for the tasks, ie the tasks start on the
>>>>> nodes where the data is. Depending on what s3 „protocol“ you are using you
>>>>> might be also more punished with performance.
>>>>>
>>>>> Try s3a as a protocol (replace all s3n with s3a).
>>>>>
>>>>> You can also use s3 url but this requires a special bucket
>>>>> configuration, a dedicated empty bucket and it lacks some ineroperability
>>>>> with other AWS services.
>>>>>
>>>>> Nevertheless, it could be also something else with the code. Can you
>>>>> post an example reproducing the issue?
>>>>>
>>>>> > Am 27.05.2020 um 18:18 schrieb Dark Crusader <
>>>>> relinquisheddra...@gmail.com>:
>>>>> >
>>>>> > 
>>>>> > Hi all,
>>>>> >
>>>>> > I am reading data from hdfs in the form of parquet files (around 3
>>>>> GB) and running an algorithm from the spark ml library.
>>>>> >
>>>>> > If I create the same spark dataframe by reading data from S3, the
>>>>> same algorithm takes considerably more time.
>>>>> >
>>>>> > I don't understand why this is happening. Is this a chance occurence
>>>>> or are the spark dataframes created different?
>>>>> >
>>>>> > I don't understand how the data store would effect the algorithm
>>>>> performance.
>>>>> >
>>>>> > Any help would be appreciated. Thanks a lot.
>>>>>
>>>>
>>>
>>> --
>>> I appreciate your time,
>>>
>>> ~Randy
>>>
>>

-- 
I appreciate your time,

~Randy


Re: Spark dataframe hdfs vs s3

2020-05-28 Thread randy clinton
See if this helps

"That is to say, on a per node basis, HDFS can yield 6X higher read
throughput than S3. Thus, *given that the S3 is 10x cheaper than HDFS, we
find that S3 is almost 2x better compared to HDFS on performance per
dollar."*

*https://databricks.com/blog/2017/05/31/top-5-reasons-for-choosing-s3-over-hdfs.html
<https://databricks.com/blog/2017/05/31/top-5-reasons-for-choosing-s3-over-hdfs.html>*


On Wed, May 27, 2020, 9:51 PM Dark Crusader 
wrote:

> Hi Randy,
>
> Yes, I'm using parquet on both S3 and hdfs.
>
> On Thu, 28 May, 2020, 2:38 am randy clinton, 
> wrote:
>
>> Is the file Parquet on S3 or is it some other file format?
>>
>> In general I would assume that HDFS read/writes are more performant for
>> spark jobs.
>>
>> For instance, consider how well partitioned your HDFS file is vs the S3
>> file.
>>
>> On Wed, May 27, 2020 at 1:51 PM Dark Crusader <
>> relinquisheddra...@gmail.com> wrote:
>>
>>> Hi Jörn,
>>>
>>> Thanks for the reply. I will try to create a easier example to reproduce
>>> the issue.
>>>
>>> I will also try your suggestion to look into the UI. Can you guide on
>>> what I should be looking for?
>>>
>>> I was already using the s3a protocol to compare the times.
>>>
>>> My hunch is that multiple reads from S3 are required because of improper
>>> caching of intermediate data. And maybe hdfs is doing a better job at this.
>>> Does this make sense?
>>>
>>> I would also like to add that we built an extra layer on S3 which might
>>> be adding to even slower times.
>>>
>>> Thanks for your help.
>>>
>>> On Wed, 27 May, 2020, 11:03 pm Jörn Franke, 
>>> wrote:
>>>
>>>> Have you looked in Spark UI why this is the case ?
>>>> S3 Reading can take more time - it depends also what s3 url you are
>>>> using : s3a vs s3n vs S3.
>>>>
>>>> It could help after some calculation to persist in-memory or on HDFS.
>>>> You can also initially load from S3 and store on HDFS and work from there .
>>>>
>>>> HDFS offers Data locality for the tasks, ie the tasks start on the
>>>> nodes where the data is. Depending on what s3 „protocol“ you are using you
>>>> might be also more punished with performance.
>>>>
>>>> Try s3a as a protocol (replace all s3n with s3a).
>>>>
>>>> You can also use s3 url but this requires a special bucket
>>>> configuration, a dedicated empty bucket and it lacks some ineroperability
>>>> with other AWS services.
>>>>
>>>> Nevertheless, it could be also something else with the code. Can you
>>>> post an example reproducing the issue?
>>>>
>>>> > Am 27.05.2020 um 18:18 schrieb Dark Crusader <
>>>> relinquisheddra...@gmail.com>:
>>>> >
>>>> > 
>>>> > Hi all,
>>>> >
>>>> > I am reading data from hdfs in the form of parquet files (around 3
>>>> GB) and running an algorithm from the spark ml library.
>>>> >
>>>> > If I create the same spark dataframe by reading data from S3, the
>>>> same algorithm takes considerably more time.
>>>> >
>>>> > I don't understand why this is happening. Is this a chance occurence
>>>> or are the spark dataframes created different?
>>>> >
>>>> > I don't understand how the data store would effect the algorithm
>>>> performance.
>>>> >
>>>> > Any help would be appreciated. Thanks a lot.
>>>>
>>>
>>
>> --
>> I appreciate your time,
>>
>> ~Randy
>>
>


Re: Spark dataframe hdfs vs s3

2020-05-27 Thread randy clinton
Is the file Parquet on S3 or is it some other file format?

In general I would assume that HDFS read/writes are more performant for
spark jobs.

For instance, consider how well partitioned your HDFS file is vs the S3
file.

On Wed, May 27, 2020 at 1:51 PM Dark Crusader 
wrote:

> Hi Jörn,
>
> Thanks for the reply. I will try to create a easier example to reproduce
> the issue.
>
> I will also try your suggestion to look into the UI. Can you guide on what
> I should be looking for?
>
> I was already using the s3a protocol to compare the times.
>
> My hunch is that multiple reads from S3 are required because of improper
> caching of intermediate data. And maybe hdfs is doing a better job at this.
> Does this make sense?
>
> I would also like to add that we built an extra layer on S3 which might be
> adding to even slower times.
>
> Thanks for your help.
>
> On Wed, 27 May, 2020, 11:03 pm Jörn Franke,  wrote:
>
>> Have you looked in Spark UI why this is the case ?
>> S3 Reading can take more time - it depends also what s3 url you are using
>> : s3a vs s3n vs S3.
>>
>> It could help after some calculation to persist in-memory or on HDFS. You
>> can also initially load from S3 and store on HDFS and work from there .
>>
>> HDFS offers Data locality for the tasks, ie the tasks start on the nodes
>> where the data is. Depending on what s3 „protocol“ you are using you might
>> be also more punished with performance.
>>
>> Try s3a as a protocol (replace all s3n with s3a).
>>
>> You can also use s3 url but this requires a special bucket configuration,
>> a dedicated empty bucket and it lacks some ineroperability with other AWS
>> services.
>>
>> Nevertheless, it could be also something else with the code. Can you post
>> an example reproducing the issue?
>>
>> > Am 27.05.2020 um 18:18 schrieb Dark Crusader <
>> relinquisheddra...@gmail.com>:
>> >
>> > 
>> > Hi all,
>> >
>> > I am reading data from hdfs in the form of parquet files (around 3 GB)
>> and running an algorithm from the spark ml library.
>> >
>> > If I create the same spark dataframe by reading data from S3, the same
>> algorithm takes considerably more time.
>> >
>> > I don't understand why this is happening. Is this a chance occurence or
>> are the spark dataframes created different?
>> >
>> > I don't understand how the data store would effect the algorithm
>> performance.
>> >
>> > Any help would be appreciated. Thanks a lot.
>>
>

-- 
I appreciate your time,

~Randy


Re: Left Join at SQL query gets planned as inner join

2020-04-30 Thread randy clinton
Does it still plan an inner join if you remove a filter on both tables?

It seems like you are asking for a left join, but your filters demand the
behavior of an inner join.

Maybe you could do the filters on the tables first and then join them.

Something roughly like..

s_DF = s_DF.filter(year = 2020 and month = 4 and day = 29)
p_DF = p_DF.filter(year = 2020 and month = 4 and day = 29 and event_id is
null)

output = s_DF.join(p_DF, event_id == source_event_id, left)



On Thu, Apr 30, 2020 at 11:06 AM Roland Johann
 wrote:

> Hi All,
>
>
> we are on vanilla Spark 2.4.4 and currently experience a somehow strange
> behavior of the query planner/optimizer and therefore get wrong results.
>
> select
> s.event_id as search_event_id,
> s.query_string,
> p.event_id
> from s
> left outer join p on s.event_id = p.source_event_id
> where
> s.year = 2020 and s.month = 4 and s.day = 29
> and p.year = 2020 and p.month = 4 and p.day = 29
> limit 1
>
> This query leads to that plan:
>
> *(2) Project [event_id#12131 AS search_event_id#12118, query_string#12178, 
> event_id#12209]
> +- *(2) BroadcastHashJoin [event_id#12131], [source_event_id#12221], Inner, 
> BuildLeft
>:- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, 
> true]))
>:  +- *(1) Project [event_id#12131, query_string#12178]
>: +- *(1) Filter isnotnull(event_id#12131)
>:+- *(1) FileScan parquet 
> s[event_id#12131,query_string#12178,year#12194,month#12195,day#12196] 
> Batched: true, Format: Parquet, Location: 
> PrunedInMemoryFileIndex[hdfs:///search/year=2020/month=4/day=29/...,
>  PartitionCount: 1, PartitionFilters: [isnotnull(year#12194), 
> isnotnull(month#12195), isnotnull(day#12196), (year#12194 = 2020), (month..., 
> PushedFilters: [IsNotNull(event_id)], ReadSchema: 
> struct
>+- *(2) Project [event_id#12209, source_event_id#12221]
>   +- *(2) Filter isnotnull(source_event_id#12221)
>  +- *(2) FileScan parquet 
> s[event_id#12209,source_event_id#12221,year#12308,month#12309,day#12310] 
> Batched: true, Format: Parquet, Location: 
> PrunedInMemoryFileIndex[hdfs:///p/year=2020/month=4/day=2..., 
> PartitionCount: 1, PartitionFilters: [isnotnull(day#12310), 
> isnotnull(year#12308), isnotnull(month#12309), (year#12308 = 2020), 
> (month..., PushedFilters: [IsNotNull(source_event_id)], ReadSchema: 
> struct
>
> Without partition pruning the join gets planned as LeftOuter, with
> SortMergeJoin but we need partition pruning in this case to prevent full
> table scans and profit from broadcast join...
>
> As soon as we rewrite the query with scala the plan looks fine
>
> val s = spark.sql("select event_id, query_string from ssi_kpi.search where 
> year = 2020 and month = 4 and day = 29")
> val p = spark.sql("select event_id, source_event_id from ssi_kpi.pda_show 
> where year = 2020 and month = 4 and day = 29")
>
> s
>   .join(p, s("event_id") <=> p("source_event_id"), "left_outer")
>   .groupBy(s("query_string"))
>   .agg(count(s("query_string")), count(p("event_id")))
>   .show()
>
>
>
> The second thing we saw that conditions at the where clause of joined
> tables gets pushed down to the parquet files and lead to wring results, for
> example:
>
> select
> s.event_id as search_event_id,
> s.query_string,
> p.event_id
> from s
> left outer join p on s.event_id = p.source_event_id
> where
> s.year = 2020 and s.month = 4 and s.day = 29
> and p.year = 2020 and p.month = 4 and p.day = 29
> and p.event_id is null
>
> Until now I assumed that the string based queries and the scala dsl lead
> to the same execution plan. Can someone point to docs about the internals
> of this topic of spark? The official docs about SQL in general are not that
> verbose.
>
> Thanks in advance and stay safe!
>
> Roland Johann
>


-- 
I appreciate your time,

~Randy

randyclin...@gmail.com