Re: Attaching Remote Debugger to Executor Threads

2020-10-15 Thread Jeff Evans
Use spark.executor.extraJavaOptions

https://spark.apache.org/docs/latest/configuration.html#runtime-environment

On Thu, Oct 15, 2020 at 1:22 PM Akshat Bordia 
wrote:

> Hi,
>
> We are trying to debug an issue with Spark and need to connect a remote
> debugger to the executors thread. The general options with spark-submit (
> =-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005)
> seems to allow only debugging driver thread. Does anyone have any insight
> on how we can attach a remote debugger to executor threads? Appreciate any
> pointers.
>
> Thanks,
> Akshat
>
> --
> Regards,
> Akshat Bordia
> Email id : akshat.bordi...@gmail.com
>


Supporting Row and DataFrame level metadata?

2020-09-24 Thread Jeff Evans
Hi,

I'm wondering if there has been any past discussion on the subject of
supporting metadata attributes as a first class concept, both at the row
level, as well as the DataFrame level?  I did a Jira search, but most of
the items I found were unrelated to this concept, or pertained to column
level metadata, which is of course already supported.

Row-level metadata, would be useful in scenarios like the following:

   - Lineage and provenance attributes, which need to eventually be
   propagated to some other system, but which shouldn't be written out with
   the "regular" DataFrame.
   - Other custom attributes, such as the input_file_name
   
   for data read from HDFS, message keys from Kafka

So why not just store regular the attributes as regular columns (possibly
with some special prefix to help us filter them out if needed)?

   - When passing the DataFrame to another piece of library code, we might
   need to remove those columns, depending on what it does (ex: if it operates
   on every column).  Or we might need to perform an extra join in order to
   "retain" the attributes from the rows processed by the library function.
   - If we need to union an existing DataFrame (with metadata) and another
   one that we read from another source (which has different, or no
   metadata).  If metadata attributes are represented as normal columns, we
   have to do some finagling to get the union to work properly.
   - If we want to simply write the DataFrame somewhere, we probably don't
   want to mix metadata attributes with the actual data.

For DataFrame-level metadata:

   - Attributes such as the table/schema/DB name, or primary key
   information, for DataFrames read from JDBC (ex: downstream processing might
   want to always partitionBy these key columns, whatever they happen to be)
   - Adding tracking information about what app-specific processing steps
   have been applied so far, their timings, etc.
   - For SQL sources, capturing the full query that produced the DataFrame

Some of these scenarios can be made easier by custom code with implicit
conversions, as outlined here .
But that has its own drawbacks and shortcomings (as outlined in the
comments).

How are people currently managing this?  Does it make sense, conceptually,
as something that Spark should directly support?


Re: Unsubscribe

2020-06-23 Thread Jeff Evans
That is not how you unsubscribe.  See here:
https://gist.github.com/jeff303/ba1906bb7bcb2f2501528a8bb1521b8e

On Tue, Jun 23, 2020 at 5:02 AM Kiran Kumar Dusi 
wrote:

> Unsubscribe
>
> On Tue, 23 Jun 2020 at 15:18 Akhil Anil  wrote:
>
>> --
>> Sent from Gmail Mobile
>>
>


Re: ./dev/run-tests failing at master

2020-05-14 Thread Jeff Evans
Are you positive you set up your Python environment correctly?  To me,
those error messages look like you are running Python 2, but it should be
Python 3.

On Thu, May 14, 2020 at 1:34 PM Sudharshann D  wrote:

> Hello! ;)
>
> I'm new to spark development and have been trying to set up my dev
> environment for hours without much success. :-(
>
> Firstly, I'm wondering why my  ./dev/run-tests fails even though in on the
> master branch.
>
> This is the error:
>
> flake8 checks failed:
> ./examples/src/main/python/sql/arrow.py:67:16: E999 SyntaxError: invalid
> syntax
> def func(s1: pd.Series, s2: pd.Series, s3: pd.DataFrame) ->
> pd.DataFrame:
>^
> ./python/run-tests.py:123:48: E999 SyntaxError: invalid syntax
> print(decoded_line, end='')
>^
> ./dev/run-tests-jenkins.py:43:20: E999 SyntaxError: invalid syntax
> print(msg, file=sys.stderr)
>^
> ./dev/run-tests.py:273:44: E999 SyntaxError: invalid syntax
> print(line.decode('utf-8'), end='')
>^
> ./dev/pip-sanity-check.py:31:76: E999 SyntaxError: invalid syntax
> print("Value {0} did not match expected value.".format(value),
> file=sys.stderr)
>
>  ^
> 5 E999 SyntaxError: invalid syntax
> 5
> 1
> [error] running /home/vagrant/gtSync/suddhuApache/spark/dev/lint-python ;
> received return code 1
>
>
>
> Would love to have some help with setting up my dev environment. I've done
> some other open source contributions and setting up the dev env there was
> easier...
>
> Regards
> Sudha
> ᐧ
>


Re: What options do I have to handle third party classes that are not serializable?

2020-02-25 Thread Jeff Evans
Did you try this?  https://stackoverflow.com/a/2114387/375670


On Tue, Feb 25, 2020 at 10:23 AM yeikel valdes  wrote:

> I am currently using a third party library(Lucene) with Spark that is not
> serializable. Due to that reason, it generates the following exception  :
>
> Job aborted due to stage failure: Task 144.0 in stage 25.0 (TID 2122) had a 
> not serializable result: org.apache.lucene.facet.FacetsConfig Serialization 
> stack: - object not serializable (class: 
> org.apache.lucene.facet.FacetsConfig, value: 
> org.apache.lucene.facet.FacetsConfg
>
> While it would be ideal if this class was serializable, there is really 
> nothing I can do to change this third party library in order to add 
> serialization to it.
>
> What options do I have, and what's the recommended option to handle this 
> problem?
>
> Thank you!
>
>
>


PySpark setup for IntelliJ IDEA

2020-01-24 Thread Jeff Evans
I couldn't find any specific information on setting up IntelliJ to debug
PySpark correctly, so I did a short writeup here, after fumbling my way
through it: https://github.com/jeff303/spark-development-tips

Any improvements, corrections, or suggestions are welcomed.


Re: Build error: python/lib/pyspark.zip is not a ZIP archive

2020-01-10 Thread Jeff Evans
Actually, there is a really trivial fix for that (an existing file not
being deleted when packaging).  Opened SPARK-30489 for it.

On Fri, Jan 10, 2020 at 3:52 PM Jeff Evans 
wrote:

> Thanks for the tip.  Fixed by simply removing python/lib/pyspark.zip
> (since it's apparently generated), and rebuilding.  I guess clean does
> not remove it.
>
> On Fri, Jan 10, 2020 at 3:50 PM Sean Owen  wrote:
>
>> Sounds like you might have some corrupted file locally. I don't see
>> any of the automated test builders failing. Nuke your local assembly
>> build and try again?
>>
>> On Fri, Jan 10, 2020 at 3:49 PM Jeff Evans
>>  wrote:
>> >
>> > Greetings,
>> >
>> > I'm getting an error when building, on latest master (2bd873181 as of
>> this writing).  Full build command I'm running is: ./build/mvn -DskipTests
>> clean package
>> >
>> > [ERROR] Failed to execute goal
>> org.apache.maven.plugins:maven-antrun-plugin:1.8:run (create-tmp-dir) on
>> project spark-assembly_2.12: An Ant BuildException has occured: Problem
>> reading /Users/jeff/dev/spark/python/lib/pyspark.zip
>> > [ERROR] around Ant part ...> destfile="/Users/jeff/dev/spark/assembly/../python/lib/pyspark.zip">... @
>> 6:76 in /Users/jeff/dev/spark/assembly/target/antrun/build-main.xml:
>> archive is not a ZIP archive
>> > [ERROR] -> [Help 1]
>> >
>> > Trying to run unzip -l python/lib/pyspark.zip does seem to suggest it's
>> not a valid zip file.  Any ideas what might be wrong?  I tried searching
>> the archives and didn't see anything relevant.  Thanks.
>> >
>> > OS X Catalina 10.5.2
>> > OpenJDK 1.8.0_212
>> > Maven 3.6.3
>> > Python 3.8.1 (via pyenv)
>>
>


Re: Build error: python/lib/pyspark.zip is not a ZIP archive

2020-01-10 Thread Jeff Evans
Thanks for the tip.  Fixed by simply removing python/lib/pyspark.zip (since
it's apparently generated), and rebuilding.  I guess clean does not remove
it.

On Fri, Jan 10, 2020 at 3:50 PM Sean Owen  wrote:

> Sounds like you might have some corrupted file locally. I don't see
> any of the automated test builders failing. Nuke your local assembly
> build and try again?
>
> On Fri, Jan 10, 2020 at 3:49 PM Jeff Evans
>  wrote:
> >
> > Greetings,
> >
> > I'm getting an error when building, on latest master (2bd873181 as of
> this writing).  Full build command I'm running is: ./build/mvn -DskipTests
> clean package
> >
> > [ERROR] Failed to execute goal
> org.apache.maven.plugins:maven-antrun-plugin:1.8:run (create-tmp-dir) on
> project spark-assembly_2.12: An Ant BuildException has occured: Problem
> reading /Users/jeff/dev/spark/python/lib/pyspark.zip
> > [ERROR] around Ant part ... destfile="/Users/jeff/dev/spark/assembly/../python/lib/pyspark.zip">... @
> 6:76 in /Users/jeff/dev/spark/assembly/target/antrun/build-main.xml:
> archive is not a ZIP archive
> > [ERROR] -> [Help 1]
> >
> > Trying to run unzip -l python/lib/pyspark.zip does seem to suggest it's
> not a valid zip file.  Any ideas what might be wrong?  I tried searching
> the archives and didn't see anything relevant.  Thanks.
> >
> > OS X Catalina 10.5.2
> > OpenJDK 1.8.0_212
> > Maven 3.6.3
> > Python 3.8.1 (via pyenv)
>


Build error: python/lib/pyspark.zip is not a ZIP archive

2020-01-10 Thread Jeff Evans
Greetings,

I'm getting an error when building, on latest master (2bd873181 as of this
writing).  Full build command I'm running is: ./build/mvn -DskipTests clean
package

[ERROR] Failed to execute goal
org.apache.maven.plugins:maven-antrun-plugin:1.8:run (create-tmp-dir) on
project spark-assembly_2.12: An Ant BuildException has occured: Problem
reading /Users/jeff/dev/spark/python/lib/pyspark.zip
[ERROR] around Ant part .. @
6:76 in /Users/jeff/dev/spark/assembly/target/antrun/build-main.xml:
archive is not a ZIP archive
[ERROR] -> [Help 1]

Trying to run unzip -l python/lib/pyspark.zip does seem to suggest it's not
a valid zip file.  Any ideas what might be wrong?  I tried searching the
archives and didn't see anything relevant.  Thanks.

   - OS X Catalina 10.5.2
   - OpenJDK 1.8.0_212
   - Maven 3.6.3
   - Python 3.8.1 (via pyenv)