Why NPE happen with multi threading in cluster mode but not client model

2020-12-02 Thread lk_spark
hi,all : I'm using spark2.4, I try to use multi thread to use sparkcontext , I found a example : https://hadoopist.wordpress.com/2017/02/03/how-to-use-threads-in-spark-job-to-achieve-parallel-read-and-writes/ some code like this : for (a <- 0 until 4) { val thread = new Thread {

Re: Structured Streaming Checkpoint Error

2020-12-02 Thread Jungtaek Lim
In theory it would work, but works very inefficiently on checkpointing. If I understand correctly, it will write the content to the temp file on s3, and rename the file which actually gets the temp file from s3 and write the content of temp file to the final path on s3. Compared to checkpoint with

Re: In windows 10, accessing Hive from PySpark with PyCharm throws error

2020-12-02 Thread Artemis User
Apparently this is a OS dynamic lib link error.  Make sure you have the LD_LIBRARY_PATH (in Linux) or PATH (windows) set up properly for the right .so or .dll file... On 12/2/20 5:31 PM, Mich Talebzadeh wrote: Hi, I have a simple code that tries to create Hive derby database as follows:

In windows 10, accessing Hive from PySpark with PyCharm throws error

2020-12-02 Thread Mich Talebzadeh
Hi, I have a simple code that tries to create Hive derby database as follows: from pyspark import SparkContext from pyspark.sql import SQLContext from pyspark.sql import HiveContext from pyspark.sql import SparkSession from pyspark.sql import Row from pyspark.sql.types import StringType,

Re: Spark ML / ALS question

2020-12-02 Thread Sean Owen
There is only a fit() method in spark.ml's ALS http://spark.apache.org/docs/latest/api/scala/org/apache/spark/ml/recommendation/ALS.html The older spark.mllib interface has a train() method. You'd generally use the spark.ml version. On Wed, Dec 2, 2020 at 2:13 PM Steve Pruitt wrote: > I am

Spark ML / ALS question

2020-12-02 Thread Steve Pruitt
I am having a little difficulty finding information on the ALS train(…) method in spark.ml. Its unclear when to use it. In the java doc, the parameters are undocumented. What is difference between train(..) and fit(..). When would do you use one or the other? -S

Spark UI Storage Memory

2020-12-02 Thread Amit Sharma
Hi , I have a spark streaming job. When I am checking the Excetors tab , there is a Storage Memory column. It displays used memory /total memory. What is used memory. Is it memory in use or memory used so far. How would I know how much memory is unused at 1 point of time. Thanks Amit

Re: Regexp_extract not giving correct output

2020-12-02 Thread Sean Owen
This means there is something wrong with your regex vs what Java supports. Do you mean "(?:" rather than "(?" around where the error is? This is not related to Spark. On Wed, Dec 2, 2020 at 9:45 AM Sachit Murarka wrote: > Hi Sean, > > Thanks for quick response! > > I have tried with string

Re: Regexp_extract not giving correct output

2020-12-02 Thread Sachit Murarka
Hi Sean, Thanks for quick response! I have tried with string literal 'r' as a prefix that also gave an empty result.. spark.sql(r"select regexp_extract('[11] [22] [33]','(^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*)',1) as anyid").show() and as I

Re: Regexp_extract not giving correct output

2020-12-02 Thread Sean Owen
As in Java/Scala, in Python you'll need to escape the backslashes with \\. "\[" means just "[" in a string. I think you could also prefix the string literal with 'r' to disable Python's handling of escapes. On Wed, Dec 2, 2020 at 9:34 AM Sachit Murarka wrote: > Hi All, > > I am using Pyspark to

Regexp_extract not giving correct output

2020-12-02 Thread Sachit Murarka
Hi All, I am using Pyspark to get the value from a column on basis of regex. Following is the regex which I am using: (^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*) df = spark.createDataFrame([("[1234] [] [] [66]",), ("abcd",)],["stringValue"])

Re: Structured Streaming Checkpoint Error

2020-12-02 Thread German Schiavon
Hello! @Gabor Somogyi I wonder that now that s3 is *strongly consistent* , would work fine. Regards! https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-read-after-write-consistency/ On Thu, 17 Sep 2020 at 11:55, German Schiavon wrote: > Hi Gabor, > > Makes sense, thanks a lot! > > On

Re: Remove subsets from FP Growth output

2020-12-02 Thread Sean Owen
-dev Increase the threshold? Just filter the rules as desired after they are generated? It's not clear what your criteria are. On Wed, Dec 2, 2020 at 7:30 AM Aditya Addepalli wrote: > Hi, > > Is there a good way to remove all the subsets of patterns from the output > given by FP Growth? > >