Re: "java.lang.AssertionError: assertion failed: Failed to get records for **** after polling for 180000" error

2019-03-05 Thread Akshay Bhardwaj
Hi,

To better debug the issue, please check the below config properties:

   - max.partition.fetch.bytes within spark kafka consumer. If not set for
   consumer then the global config at broker level.
   - spark.streaming.kafka.consumer.poll.ms
  - spark.network.timeout (If the above is not set, then poll.ms is
  default to spark.network.timeout)
   -
   -

Akshay Bhardwaj
+91-97111-33849


On Wed, Mar 6, 2019 at 8:39 AM JF Chen  wrote:

> When my kafka executor reads data from kafka, sometimes it throws the
> error "java.lang.AssertionError: assertion failed: Failed to get records
> for  after polling for 18" , which after 3 minutes of executing.
> The data waiting for read is not so huge, which is about 1GB. And other
> partitions read by other tasks are very fast, the error always occurs on
> some specific executor..
>
> Regard,
> Junfeng Chen
>


Re: "java.lang.AssertionError: assertion failed: Failed to get records for **** after polling for 180000" error

2019-03-05 Thread Shyam P
Would be better if you share some code block to understand it better.

Else would be difficult to provide answer.

~Shyam

On Wed, Mar 6, 2019 at 8:38 AM JF Chen  wrote:

> When my kafka executor reads data from kafka, sometimes it throws the
> error "java.lang.AssertionError: assertion failed: Failed to get records
> for  after polling for 18" , which after 3 minutes of executing.
> The data waiting for read is not so huge, which is about 1GB. And other
> partitions read by other tasks are very fast, the error always occurs on
> some specific executor..
>
> Regard,
> Junfeng Chen
>


Re: spark df.write.partitionBy run very slow

2019-03-05 Thread Shyam P
Hi JF,
Yes first we should know actual number of partitions dataframe has and its
counts of records. Accordingly we should try to have data evenly in all
partitions.
It always better to have Num of paritions = N * Num of executors.


  "But the sequence of columns in  partitionBy  decides the
directory  hierarchy structure. I hope the sequence of columns not change"
, this is correct.
Hence sometimes we should go with bigger number first then lesser  try
this ..i.e. more parent directories and less child directories. Tweet
around it and try.

"some tasks in write hdfs stage cost much more time than others" may be
data is skewed, need to  distrube them evenly for all partitions.

~Shyam

On Wed, Mar 6, 2019 at 8:33 AM JF Chen  wrote:

> Hi Shyam
> Thanks for your reply.
> You mean after knowing the partition number of column_a, column_b,
> column_c, the sequence of column in partitionBy should be same to the order
> of partitions number of column a, b and c?
> But the sequence of columns in  partitionBy  decides the
> directory  hierarchy structure. I hope the sequence of columns not change.
>
> And I found one more strange things, some tasks in write hdfs stage cost
> much more time than others, where the amount of writing data is similar.
> How to solve it?
>
> Regard,
> Junfeng Chen
>
>
> On Tue, Mar 5, 2019 at 3:05 PM Shyam P  wrote:
>
>> Hi JF ,
>>  Try to execute it before df.write
>>
>> //count by partition_id
>> import org.apache.spark.sql.functions.spark_partition_id
>> df.groupBy(spark_partition_id).count.show()
>>
>> You will come to know how data has been partitioned inside df.
>>
>> Small trick we can apply here while partitionBy(column_a, column_b,
>> column_c)
>> Makes sure  you should have ( column_a  partitions) > ( column_b
>> partitions) >  ( column_c  partitions) .
>>
>> Try this.
>>
>> Regards,
>> Shyam
>>
>> On Mon, Mar 4, 2019 at 4:09 PM JF Chen  wrote:
>>
>>> I am trying to write data in dataset to hdfs via 
>>> df.write.partitionBy(column_a,
>>> column_b, column_c).parquet(output_path)
>>> However, it costs several minutes to write only hundreds of MB data to
>>> hdfs.
>>> From this article
>>> ,
>>> adding repartition method before write should work. But if there is
>>> data skew, some tasks may cost much longer time than average, which still
>>> cost much time.
>>> How to solve this problem? Thanks in advance !
>>>
>>>
>>> Regard,
>>> Junfeng Chen
>>>
>>


Re: How to group dataframe year-wise and iterate through groups and send each year to dataframe to executor?

2019-03-05 Thread Shyam P
Thanks a lot Roman.

But provided link as several ways to deal the problem.
Why do we need to do operation on RDD instead dataframe/dataset ?

Do I need a custom partitioner in my case , how to invoke it in spark-sql?

Can anyone provide some sample on handling skewed data with spark-sql?

Thanks,
Shyam

>


"java.lang.AssertionError: assertion failed: Failed to get records for **** after polling for 180000" error

2019-03-05 Thread JF Chen
When my kafka executor reads data from kafka, sometimes it throws the error
"java.lang.AssertionError: assertion failed: Failed to get records for 
after polling for 18" , which after 3 minutes of executing.
The data waiting for read is not so huge, which is about 1GB. And other
partitions read by other tasks are very fast, the error always occurs on
some specific executor..

Regard,
Junfeng Chen


Re: spark df.write.partitionBy run very slow

2019-03-05 Thread JF Chen
Hi Shyam
Thanks for your reply.
You mean after knowing the partition number of column_a, column_b,
column_c, the sequence of column in partitionBy should be same to the order
of partitions number of column a, b and c?
But the sequence of columns in  partitionBy  decides the
directory  hierarchy structure. I hope the sequence of columns not change.

And I found one more strange things, some tasks in write hdfs stage cost
much more time than others, where the amount of writing data is similar.
How to solve it?

Regard,
Junfeng Chen


On Tue, Mar 5, 2019 at 3:05 PM Shyam P  wrote:

> Hi JF ,
>  Try to execute it before df.write
>
> //count by partition_id
> import org.apache.spark.sql.functions.spark_partition_id
> df.groupBy(spark_partition_id).count.show()
>
> You will come to know how data has been partitioned inside df.
>
> Small trick we can apply here while partitionBy(column_a, column_b,
> column_c)
> Makes sure  you should have ( column_a  partitions) > ( column_b
> partitions) >  ( column_c  partitions) .
>
> Try this.
>
> Regards,
> Shyam
>
> On Mon, Mar 4, 2019 at 4:09 PM JF Chen  wrote:
>
>> I am trying to write data in dataset to hdfs via 
>> df.write.partitionBy(column_a,
>> column_b, column_c).parquet(output_path)
>> However, it costs several minutes to write only hundreds of MB data to
>> hdfs.
>> From this article
>> ,
>> adding repartition method before write should work. But if there is data
>> skew, some tasks may cost much longer time than average, which still cost
>> much time.
>> How to solve this problem? Thanks in advance !
>>
>>
>> Regard,
>> Junfeng Chen
>>
>


Re: [SQL] 64-bit hash function, and seeding

2019-03-05 Thread Huon.Wilson
Hi Nicolas,

On 6/3/19, 7:48 am, "Nicolas Paris"  wrote:

Hi Huon

Good catch. A 64 bit hash is definitely a useful function.

> the birthday paradox implies  >50% chance of at least one for tables 
larger than 77000 rows

Do you know how many rows to have 50% chances for a 64 bit hash ?

5 billion: it's roughly equal to the square root of the total number of 
possible hash values. You can see detailed table at 
https://en.wikipedia.org/wiki/Birthday_problem#Probability_table .

Note, for my application a few collisions is fine. There's a few ways of trying 
to quantify this, and one is the maximum number of items that all hash to a 
single particular hash value: if one has 4 billion rows with 32-bit hash, the 
size of this largest set is likely to be 14 (and, there's going to be many 
other smaller sets of colliding values). With a 64-bit hash, it is likely to be 
2, and the table size can be as large as ~8 trillion before the expected 
maximum exceeds 3. 
(https://en.wikipedia.org/wiki/Balls_into_bins_problem#Random_allocation)

Another way is the expected number of collisions, for the three cases above it 
is 1.6 billion (32-bit hash, 4 billion rows), 0.5 (64-bit, 4 billion), and 2.1 
million (64-bit, 8 trillion). 
(http://matt.might.net/articles/counting-hash-collisions/)

About the seed column, to me there is no need for such an argument: you
just can add an integer as a regular column.
   
You are correct that this works, but it increases the amount of computation 
(doubles it, when just trying to hash a single column). For multiple columns, 
col1, col2, ... colN, the `hash` function works approximately like (in 
pseudo-scala, and simplified from Spark's actual implementation):

val InitialSeed = 42L
def hash(col1, col2, ..., colN) = {
  var value = InitialSeed
  value = hashColumn(col1, seed = value)
  value = hashColumn(col2, seed = value)
  ...
  value = hashColumn(colN, seed = value)
  return value
}

If that starting value can be customized, then a call like `hash(lit(mySeed), 
column)` (which has to do the work to hash two columns) can be changed to 
instead just start at `mySeed`, and only hash one column. That said, for the 
hashes spark uses (xxHash and MurmurHash3), the hashing operation isn't too 
expensive, especially for ints/longs.

Huon
 
About the process for pull requests, I cannot help much


On Tue, Mar 05, 2019 at 04:30:31AM +, huon.wil...@data61.csiro.au wrote:
> Hi,
> 
> I’m working on something that requires deterministic randomness, i.e. a 
row gets the same “random” value no matter the order of the DataFrame. A seeded 
hash seems to be the perfect way to do this, but the existing hashes have 
various limitations:
> 
> - hash: 32-bit output (only 4 billion possibilities will result in a lot 
of collisions for many tables: the birthday paradox implies  >50% chance of at 
least one for tables larger than 77000 rows)
> - sha1/sha2/md5: single binary column input, string output
> 
> It seems there’s already support for a 64-bit hash function that can work 
with an arbitrary number of arbitrary-typed columns: XxHash64, and exposing 
this for DataFrames seems like it’s essentially one line in sql/functions.scala 
to match `hash` (plus docs, tests, function registry etc.):
> 
> def hash64(cols: Column*): Column = withExpr { new 
XxHash64(cols.map(_.expr)) }
> 
> For my use case, this can then be used to get a 64-bit “random” column 
like 
> 
> val seed = rng.nextLong()
> hash64(lit(seed), col1, col2)
> 
> I’ve created a (hopefully) complete patch by mimicking ‘hash’ at 
https://github.com/apache/spark/compare/master...huonw:hash64; should I open a 
JIRA and submit it as a pull request?
> 
> Additionally, both hash and the new hash64 already have support for being 
seeded, but this isn’t exposed directly and instead requires something like the 
`lit` above. Would it make sense to add overloads like the following?
> 
> def hash(seed: Int, cols: Columns*) = …
> def hash64(seed: Long, cols: Columns*) = …
> 
> Though, it does seem a bit unfortunate to be forced to pass the seed 
first.
> 
> - Huon
> 
>  
> 


> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org


-- 
nicolas

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org





Re: [SQL] 64-bit hash function, and seeding

2019-03-05 Thread Nicolas Paris
Hi Huon

Good catch. A 64 bit hash is definitely a useful function.

> the birthday paradox implies  >50% chance of at least one for tables larger 
> than 77000 rows

Do you know how many rows to have 50% chances for a 64 bit hash ?


About the seed column, to me there is no need for such an argument: you
just can add an integer as a regular column.

About the process for pull requests, I cannot help much


On Tue, Mar 05, 2019 at 04:30:31AM +, huon.wil...@data61.csiro.au wrote:
> Hi,
> 
> I’m working on something that requires deterministic randomness, i.e. a row 
> gets the same “random” value no matter the order of the DataFrame. A seeded 
> hash seems to be the perfect way to do this, but the existing hashes have 
> various limitations:
> 
> - hash: 32-bit output (only 4 billion possibilities will result in a lot of 
> collisions for many tables: the birthday paradox implies  >50% chance of at 
> least one for tables larger than 77000 rows)
> - sha1/sha2/md5: single binary column input, string output
> 
> It seems there’s already support for a 64-bit hash function that can work 
> with an arbitrary number of arbitrary-typed columns: XxHash64, and exposing 
> this for DataFrames seems like it’s essentially one line in 
> sql/functions.scala to match `hash` (plus docs, tests, function registry 
> etc.):
> 
> def hash64(cols: Column*): Column = withExpr { new 
> XxHash64(cols.map(_.expr)) }
> 
> For my use case, this can then be used to get a 64-bit “random” column like 
> 
> val seed = rng.nextLong()
> hash64(lit(seed), col1, col2)
> 
> I’ve created a (hopefully) complete patch by mimicking ‘hash’ at 
> https://github.com/apache/spark/compare/master...huonw:hash64; should I open 
> a JIRA and submit it as a pull request?
> 
> Additionally, both hash and the new hash64 already have support for being 
> seeded, but this isn’t exposed directly and instead requires something like 
> the `lit` above. Would it make sense to add overloads like the following?
> 
> def hash(seed: Int, cols: Columns*) = …
> def hash64(seed: Long, cols: Columns*) = …
> 
> Though, it does seem a bit unfortunate to be forced to pass the seed first.
> 
> - Huon
> 
>  
> 


> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org


-- 
nicolas

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Is there a way to validate the syntax of raw spark sql query?

2019-03-05 Thread kant kodali
Hi Akshay,

Thanks for this. I will give it a try. The Java API for .explain returns
void. It doesn't throw any checked exception. so I guess I have to catch
the generic RuntimeException and walk through the stacktrace to see if
there is any ParseException. In short, the code just gets really ugly. I
wish the Spark guys take all of this into account when designing an API
such that it works well for all languages.

Thanks!

On Tue, Mar 5, 2019 at 4:15 AM Akshay Bhardwaj <
akshay.bhardwaj1...@gmail.com> wrote:

> Hi Kant,
>
> You can try "explaining" the sql query.
>
> spark.sql(sqlText).explain(true); //the parameter true is to get more
> verbose query plan and it is optional.
>
>
> This is the safest way to validate sql without actually executing/creating
> a df/view in spark. It validates syntax as well as schema of tables/views
> used.
> If there is an issue with your SQL syntax then the method throws below
> exception that you can catch
>
> org.apache.spark.sql.catalyst.parser.ParseException
>
>
> Hope this helps!
>
>
>
> Akshay Bhardwaj
> +91-97111-33849
>
>
> On Fri, Mar 1, 2019 at 10:23 PM kant kodali  wrote:
>
>> Hi All,
>>
>> Is there a way to validate the syntax of raw spark SQL query?
>>
>> for example, I would like to know if there is any isValid API call spark
>> provides?
>>
>> val query = "select * from table"if(isValid(query)) {
>> sparkSession.sql(query) } else {
>> log.error("Invalid Syntax")}
>>
>> I tried the following
>>
>> val query = "select * morf table" // Invalid queryval parser = 
>> spark.sessionState.sqlParsertry{
>> parser.parseExpression(query)} catch (ParseException ex) {
>> throw new Exception(ex); //Exception not getting thrown}Dataset<>Row df 
>> = sparkSession.sql(query) // Exception gets thrown here
>> df.writeStream.format("console").start()
>>
>> Question: parser.parseExpression is not catching the invalid syntax
>> before I hit the sparkSession.sql. Other words it is not being helpful
>> in the above code. any reason? My whole goal is to catch syntax errors
>> before I pass it on to sparkSession.sql
>>
>>
>>


Re: [Kubernets] [SPARK-27061] Need to expose 4040 port on driver service

2019-03-05 Thread Chandu Kavar
Thank you for the clarification.

On Tue, Mar 5, 2019 at 11:59 PM Gabor Somogyi 
wrote:

> Hi,
>
> It will be automatically assigned when one creates a PR.
>
> BR,
> G
>
>
> On Tue, Mar 5, 2019 at 4:51 PM Chandu Kavar  wrote:
>
>> My Jira username is: *cckavar*
>>
>> On Tue, Mar 5, 2019 at 11:46 PM Chandu Kavar  wrote:
>>
>>> Hi Team,
>>> I have created a JIRA ticket to expose 4040 port on driver service.
>>> Also, added the details in the ticket.
>>>
>>> I am aware of the changes that we need to make and want to assign this
>>> task to my self.
>>>
>>> I am not able to assign by myself. Can someone help me to assign this
>>> task to me.
>>>
>>> Thank you,
>>> Chandu
>>>
>>


Re: [Kubernets] [SPARK-27061] Need to expose 4040 port on driver service

2019-03-05 Thread Gabor Somogyi
Hi,

It will be automatically assigned when one creates a PR.

BR,
G


On Tue, Mar 5, 2019 at 4:51 PM Chandu Kavar  wrote:

> My Jira username is: *cckavar*
>
> On Tue, Mar 5, 2019 at 11:46 PM Chandu Kavar  wrote:
>
>> Hi Team,
>> I have created a JIRA ticket to expose 4040 port on driver service. Also,
>> added the details in the ticket.
>>
>> I am aware of the changes that we need to make and want to assign this
>> task to my self.
>>
>> I am not able to assign by myself. Can someone help me to assign this
>> task to me.
>>
>> Thank you,
>> Chandu
>>
>


Re: [Kubernets] [SPARK-27061] Need to expose 4040 port on driver service

2019-03-05 Thread Chandu Kavar
My Jira username is: *cckavar*

On Tue, Mar 5, 2019 at 11:46 PM Chandu Kavar  wrote:

> Hi Team,
> I have created a JIRA ticket to expose 4040 port on driver service. Also,
> added the details in the ticket.
>
> I am aware of the changes that we need to make and want to assign this
> task to my self.
>
> I am not able to assign by myself. Can someone help me to assign this task
> to me.
>
> Thank you,
> Chandu
>


[Kubernets] [SPARK-27061] Need to expose 4040 port on driver service

2019-03-05 Thread Chandu Kavar
Hi Team,
I have created a JIRA ticket to expose 4040 port on driver service. Also,
added the details in the ticket.

I am aware of the changes that we need to make and want to assign this task
to my self.

I am not able to assign by myself. Can someone help me to assign this task
to me.

Thank you,
Chandu


[PySpark] TypeError: expected string or bytes-like object

2019-03-05 Thread Thomas Ryck
I am using PySpark through JupyterLab using the Spark distibution provided
with *conda install pyspark*. So I run Spark locally.

I started using pyspark 2.4.0 but I had a Socket issue which I solved with
downgrading the package to 2.3.2.

So I am using pyspark 2.3.2 at the moment.

I am trying to do an orderBy on one of my dataframe. 

That way : dfRequestwithTime = dfRequestwithTime.orderBy('time', ascending =
False)

but I get the following error. (I am sure that this is that line which is
the issue since without I do not get any error.)

This dataframe contains 2 columns:
- a column "request" containing strings
- a column "time" which is in fact a duration containing integers

Here the first 2 rows.

[Row(request='SELECT XXX FROM XXX WHERE XXX ', time=4),
 Row(request='SELECT XXX FROM XXX WHERE XXX ', time=1)]

I am getting my duration thanks to RegEx collected earlier in the notebook.
I casted my "time" column to Integer and droped NaN on my dataset.

I get the same following error if I try to do an orderBy on my 'request'
column.



---
Py4JJavaError Traceback (most recent call last)
 in 
> 1 dfRequestwithTime.head(5)

~\Anaconda3\lib\site-packages\pyspark\sql\dataframe.py in head(self, n)
   1132 rs = self.head(1)
   1133 return rs[0] if rs else None
-> 1134 return self.take(n)
   1135 
   1136 @ignore_unicode_prefix

~\Anaconda3\lib\site-packages\pyspark\sql\dataframe.py in take(self, num)
502 [Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]
503 """
--> 504 return self.limit(num).collect()
505 
506 @since(1.3)

~\Anaconda3\lib\site-packages\pyspark\sql\dataframe.py in collect(self)
464 """
465 with SCCallSiteSync(self._sc) as css:
--> 466 sock_info = self._jdf.collectToPython()
467 return list(_load_from_socket(sock_info,
BatchedSerializer(PickleSerializer(
468 

~\Anaconda3\lib\site-packages\py4j\java_gateway.py in __call__(self, *args)
   1255 answer = self.gateway_client.send_command(command)
   1256 return_value = get_return_value(
-> 1257 answer, self.gateway_client, self.target_id, self.name)
   1258 
   1259 for temp_arg in temp_args:

~\Anaconda3\lib\site-packages\pyspark\sql\utils.py in deco(*a, **kw)
 61 def deco(*a, **kw):
 62 try:
---> 63 return f(*a, **kw)
 64 except py4j.protocol.Py4JJavaError as e:
 65 s = e.java_exception.toString()

~\Anaconda3\lib\site-packages\py4j\protocol.py in get_return_value(answer,
gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(

Py4JJavaError: An error occurred while calling o5404.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 26
in stage 190.0 failed 1 times, most recent failure: Lost task 26.0 in stage
190.0 (TID 4856, localhost, executor driver):
org.apache.spark.api.python.PythonException: Traceback (most recent call
last):
  File
"C:\Users\tryck\Anaconda3\Lib\site-packages\pyspark\python\lib\pyspark.zip\pyspark\worker.py",
line 253, in main
  File
"C:\Users\tryck\Anaconda3\Lib\site-packages\pyspark\python\lib\pyspark.zip\pyspark\worker.py",
line 248, in process
  File
"C:\Users\tryck\Anaconda3\Lib\site-packages\pyspark\python\lib\pyspark.zip\pyspark\serializers.py",
line 331, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
  File
"C:\Users\tryck\Anaconda3\Lib\site-packages\pyspark\python\lib\pyspark.zip\pyspark\serializers.py",
line 140, in dump_stream
for obj in iterator:
  File
"C:\Users\tryck\Anaconda3\Lib\site-packages\pyspark\python\lib\pyspark.zip\pyspark\serializers.py",
line 320, in _batched
for item in iterator:
  File "", line 1, in 
  File
"C:\Users\tryck\Anaconda3\Lib\site-packages\pyspark\python\lib\pyspark.zip\pyspark\worker.py",
line 76, in 
  File
"C:\Users\tryck\Anaconda3\Lib\site-packages\pyspark\python\lib\pyspark.zip\pyspark\util.py",
line 55, in wrapper
return f(*args, **kwargs)
  File
"C:\Users\tryck\Anaconda3\Lib\site-packages\pyspark\python\lib\pyspark.zip\pyspark\worker.py",
line 68, in 
  File "", line 1, in 
  File "C:\Users\tryck\Anaconda3\lib\re.py", line 191, in sub
return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or bytes-like object

at
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:330)
at
org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:83)
at

Why does Apache Spark Master shutdown when Zookeeper expires the session

2019-03-05 Thread lokeshkumar
As I understand, Apache Spark Master can be run in high availability mode
using Zookeeper. That is, multiple Spark masters can run in Leader/Follower
mode and these modes are registered with Zookeeper.

In our scenario Zookeeper is expiring the Spark Master's session which is
acting as Leader. So the Spark MAster which is leader receives this
notification and shutsdown deliberately.

Can someone explain why this decision os shutting down rather than retrying
has been taken?

And why does Kafka retry connecting to Zookeeper when it receives the same
Expiry notification?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Is there a way to validate the syntax of raw spark sql query?

2019-03-05 Thread Akshay Bhardwaj
Hi Kant,

You can try "explaining" the sql query.

spark.sql(sqlText).explain(true); //the parameter true is to get more
verbose query plan and it is optional.


This is the safest way to validate sql without actually executing/creating
a df/view in spark. It validates syntax as well as schema of tables/views
used.
If there is an issue with your SQL syntax then the method throws below
exception that you can catch

org.apache.spark.sql.catalyst.parser.ParseException


Hope this helps!



Akshay Bhardwaj
+91-97111-33849


On Fri, Mar 1, 2019 at 10:23 PM kant kodali  wrote:

> Hi All,
>
> Is there a way to validate the syntax of raw spark SQL query?
>
> for example, I would like to know if there is any isValid API call spark
> provides?
>
> val query = "select * from table"if(isValid(query)) {
> sparkSession.sql(query) } else {
> log.error("Invalid Syntax")}
>
> I tried the following
>
> val query = "select * morf table" // Invalid queryval parser = 
> spark.sessionState.sqlParsertry{
> parser.parseExpression(query)} catch (ParseException ex) {
> throw new Exception(ex); //Exception not getting thrown}Dataset<>Row df = 
> sparkSession.sql(query) // Exception gets thrown here
> df.writeStream.format("console").start()
>
> Question: parser.parseExpression is not catching the invalid syntax
> before I hit the sparkSession.sql. Other words it is not being helpful in
> the above code. any reason? My whole goal is to catch syntax errors before
> I pass it on to sparkSession.sql
>
>
>


Re: How to add more imports at the start of REPL

2019-03-05 Thread Nuthan Reddy
Ok, thanks. It does solve the problem.

Nuthan Reddy
Sigmoid Analytics



On Tue, Mar 5, 2019 at 5:17 PM Jesús Vásquez 
wrote:

> Hi Nuthan, I have had the same issue before.
> As a shortcut i created a text file called imports and then loaded it's
> content with the command :load of the scala repl
> Example:
> Create "imports" textfile with the import instructions you need
>
> import org.apache.hadoop.conf.Configuration
> import org.apache.hadoop.fs.Path
>
> Then load it with :load
>
> :load imports
>
> Once you have the textfile with the import instructions you just have to
> load it's content each time you start the repl.
> Regards.
>
>
> El mar., 5 de mar. de 2019 a la(s) 12:14, Nuthan Reddy (
> nut...@sigmoidanalytics.com) escribió:
>
>> Hi,
>>
>> When launching the REPL using spark-submit, the following are loaded
>> automatically.
>>
>> scala> :imports
>>
>>  1) import org.apache.spark.SparkContext._ (70 terms, 1 are implicit)
>>
>>  2) import spark.implicits._   (1 types, 67 terms, 37 are implicit)
>>
>>  3) import spark.sql   (1 terms)
>>
>>  4) import org.apache.spark.sql.functions._ (385 terms)
>> And i would like to add more imports which i frequently use to reduce the
>> typing that i do for these imports.
>> Can anyone suggest a way to do this?
>>
>> Nuthan Reddy
>> Sigmoid Analytics
>>
>>
>> *Disclaimer*: This is not a mass e-mail and my intention here is purely
>> from a business perspective, and not to spam or encroach your privacy. I am
>> writing with a specific agenda to build a personal business connection.
>> Being a reputed and genuine organization, Sigmoid respects the digital
>> security of every prospect and tries to comply with GDPR and other regional
>> laws. Please let us know if you feel otherwise and we will rectify the
>> misunderstanding and adhere to comply in the future. In case we have missed
>> any of the compliance, it is completely unintentional.
>>
>

-- 
*Disclaimer*: This is not a mass e-mail and my intention here is purely 
from a business perspective, and not to spam or encroach your privacy. I am 
writing with a specific agenda to build a personal business connection. 
Being a reputed and genuine organization, Sigmoid respects the digital 
security of every prospect and tries to comply with GDPR and other regional 
laws. Please let us know if you feel otherwise and we will rectify the 
misunderstanding and adhere to comply in the future. In case we have missed 
any of the compliance, it is completely unintentional.


Re: How to add more imports at the start of REPL

2019-03-05 Thread Jesús Vásquez
Hi Nuthan, I have had the same issue before.
As a shortcut i created a text file called imports and then loaded it's
content with the command :load of the scala repl
Example:
Create "imports" textfile with the import instructions you need

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.Path

Then load it with :load

:load imports

Once you have the textfile with the import instructions you just have to
load it's content each time you start the repl.
Regards.


El mar., 5 de mar. de 2019 a la(s) 12:14, Nuthan Reddy (
nut...@sigmoidanalytics.com) escribió:

> Hi,
>
> When launching the REPL using spark-submit, the following are loaded
> automatically.
>
> scala> :imports
>
>  1) import org.apache.spark.SparkContext._ (70 terms, 1 are implicit)
>
>  2) import spark.implicits._   (1 types, 67 terms, 37 are implicit)
>
>  3) import spark.sql   (1 terms)
>
>  4) import org.apache.spark.sql.functions._ (385 terms)
> And i would like to add more imports which i frequently use to reduce the
> typing that i do for these imports.
> Can anyone suggest a way to do this?
>
> Nuthan Reddy
> Sigmoid Analytics
>
>
> *Disclaimer*: This is not a mass e-mail and my intention here is purely
> from a business perspective, and not to spam or encroach your privacy. I am
> writing with a specific agenda to build a personal business connection.
> Being a reputed and genuine organization, Sigmoid respects the digital
> security of every prospect and tries to comply with GDPR and other regional
> laws. Please let us know if you feel otherwise and we will rectify the
> misunderstanding and adhere to comply in the future. In case we have missed
> any of the compliance, it is completely unintentional.
>


C++ script on Spark Cluster throws exit status 132

2019-03-05 Thread Mkal
I'm trying to run a c++ program on spark cluster by using the rdd.pipe()
operation but the executors throw: java.lang.IllegalStateException:
Subprocess exited with status 132.

The spark jar runs totally fine on standalone and the c++ program runs just
fine on its own as well. I've tried with another simple c++ script and there
is no problem with it running on the cluster.

As i understand it the number 132 means Illegal Instruction but i don't know
how to use this to pinpoint the source of this error.

I get no further info by checking the executor logs.I'm posting this here
hoping that someone has a suggestion. I've tried other forums but no luck
yet.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



How to add more imports at the start of REPL

2019-03-05 Thread Nuthan Reddy
Hi,

When launching the REPL using spark-submit, the following are loaded
automatically.

scala> :imports

 1) import org.apache.spark.SparkContext._ (70 terms, 1 are implicit)

 2) import spark.implicits._   (1 types, 67 terms, 37 are implicit)

 3) import spark.sql   (1 terms)

 4) import org.apache.spark.sql.functions._ (385 terms)
And i would like to add more imports which i frequently use to reduce the
typing that i do for these imports.
Can anyone suggest a way to do this?

Nuthan Reddy
Sigmoid Analytics

-- 
*Disclaimer*: This is not a mass e-mail and my intention here is purely 
from a business perspective, and not to spam or encroach your privacy. I am 
writing with a specific agenda to build a personal business connection. 
Being a reputed and genuine organization, Sigmoid respects the digital 
security of every prospect and tries to comply with GDPR and other regional 
laws. Please let us know if you feel otherwise and we will rectify the 
misunderstanding and adhere to comply in the future. In case we have missed 
any of the compliance, it is completely unintentional.


[no subject]

2019-03-05 Thread Shyam P
Hi All,
  I need to save a huge data frame as parquet file. As it is huge its
taking several hours. To improve performance it is known I have to send it
group wise.

But when I do partition ( columns*) /groupBy(Columns*) , driver is spilling
a lot of data and performance hits a lot again.

So how to handle this situation and save one group after another.

Attaching the sample scenario of the same.

https://stackoverflow.com/questions/54416623/how-to-group-dataframe-year-wise-and-iterate-through-groups-and-send-each-year-d

Highly appreciate your help.

Thanks,
Shyam