Re: Apache Spark - Custom structured streaming data source

2018-01-25 Thread Tathagata Das
Hello Mans,

The streaming DataSource APIs are still evolving and are not public yet.
Hence there is no official documentation. In fact, there is a new
DataSourceV2 API (in Spark 2.3) that we are migrating towards. So at this
point of time, it's hard to make any concrete suggestion. You can take a
look at the classes DataSourceV2, DataReader, MicroBatchDataReader in the
spark source code, along with their implementations.

Hope this helps.

TD

On Jan 25, 2018 8:36 PM, "M Singh"  wrote:

Hi:

I am trying to create a custom structured streaming source and would like
to know if there is any example or documentation on the steps involved.

I've looked at the some methods available in the SparkSession but these are
internal to the sql package:

  *private**[sql]* def internalCreateDataFrame(
  catalystRows: RDD[InternalRow],
  schema: StructType,
  isStreaming: Boolean = false): DataFrame = {
// TODO: use MutableProjection when rowRDD is another DataFrame and the
applied
// schema differs from the existing schema on any field data type.
val logicalPlan = LogicalRDD(
  schema.toAttributes,
  catalystRows,
  isStreaming = isStreaming)(self)
Dataset.ofRows(self, logicalPlan)
  }

Please let me know where I can find the appropriate API or documentation.

Thanks

Mans


Re: how to create a DataType Object using the String representation in Java using Spark 2.2.0?

2018-01-25 Thread Kurt Fehlhauer
Can you share your code and a sample of your data? WIthout seeing it, I
can't give a definitive answer. I can offer some hints. If you have a
column of strings you should either be able to create a new column casted
to Integer. This can be accomplished two ways:

df.withColumn("newColumn", df.currentColumn.cast(IntegerType))

or

val df = df.select("cast(CurretColumn as int) newColum")


Without seeing your json, I really can't offer assistance.


On Thu, Jan 25, 2018 at 11:39 PM, kant kodali  wrote:

> It seems like its hard to construct a DataType given its String literal
> representation.
>
> dataframe.types() return column names and its corresponding Types. for
> example say I have an integer column named "sum" doing dataframe.dtypes()
> would return "sum" and "IntegerType" but this string  representation
> "IntegerType" doesnt seem to be very useful because I cannot do
> DataType.fromJson("IntegerType") This will throw an error. so I am not
> quite sure how to construct a DataType given its String representation ?
>
> On Thu, Jan 25, 2018 at 4:22 PM, kant kodali  wrote:
>
>> Hi All,
>>
>> I have a datatype "IntegerType" represented as a String and now I want to
>> create DataType object out of that. I couldn't find in the DataType or
>> DataTypes api on how to do that?
>>
>> Thanks!
>>
>
>


Re: how to create a DataType Object using the String representation in Java using Spark 2.2.0?

2018-01-25 Thread kant kodali
It seems like its hard to construct a DataType given its String literal
representation.

dataframe.types() return column names and its corresponding Types. for
example say I have an integer column named "sum" doing dataframe.dtypes()
would return "sum" and "IntegerType" but this string  representation
"IntegerType" doesnt seem to be very useful because I cannot do
DataType.fromJson("IntegerType") This will throw an error. so I am not
quite sure how to construct a DataType given its String representation ?

On Thu, Jan 25, 2018 at 4:22 PM, kant kodali  wrote:

> Hi All,
>
> I have a datatype "IntegerType" represented as a String and now I want to
> create DataType object out of that. I couldn't find in the DataType or
> DataTypes api on how to do that?
>
> Thanks!
>


Apache Spark - Custom structured streaming data source

2018-01-25 Thread M Singh
Hi:
I am trying to create a custom structured streaming source and would like to 
know if there is any example or documentation on the steps involved.
I've looked at the some methods available in the SparkSession but these are 
internal to the sql package:
  private[sql] def internalCreateDataFrame(      catalystRows: 
RDD[InternalRow],      schema: StructType,      isStreaming: Boolean = false): 
DataFrame = {    // TODO: use MutableProjection when rowRDD is another 
DataFrame and the applied    // schema differs from the existing schema on any 
field data type.    val logicalPlan = LogicalRDD(      schema.toAttributes,     
 catalystRows,      isStreaming = isStreaming)(self)    Dataset.ofRows(self, 
logicalPlan)  } 
Please let me know where I can find the appropriate API or documentation.
Thanks
Mans

Spark Standalone Mode, application runs, but executor is killed

2018-01-25 Thread Chandu
Hi,
I tried my question @ stackoverlfow.com (
https://stackoverflow.com/questions/48445145/spark-standalone-mode-application-runs-but-executor-is-killed-with-exitstatus),
yet to be answere, so thought I will tru the user group.

I am new to Apache Spark and was trying to run the example Pi Calculation
application on my local spark setup (using Standalone Cluster). Both the
Master, Slave and Driver are running on my local machine.

What I am noticing is that, the PI is calculated successfully, however in
the slave logs I see that the Worker/Executor is being killed with
exitStatus 1. I do not see any errors/exceptions logged to the console
otherwise. I tried finding help on similar issue, but most of the search
hits were referring to exitStatus 137 etc. (e.g: Spark application kills
executor
https://stackoverflow.com/questions/40910952/spark-application-kills-executor
 
 

I have failed miserably to understand why the Worker is being killed
instead of completing the execution with 'EXITED' state. I think it's
related to how I am executing the app, but am not quite clear what am I
doing wrong.

The code and logs are available @
https://gist.github.com/Chandu/a83c13c045f1d1b480d8839e145b2749
   (trying
to
keep the email content short)

I wantd to understand that if my assumption of an Executor should have a
state of Exited when there are no errors in execution or is it always set
as KILLED when a spark job is completed?

I tried to understand the flow looking at the source code and with my
limited understanding of the code, I found that the Executor would always
end up with KILLED status (most likely my conclusion is wrong) based on the
code @
https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/core/src/main/scala/org/apache/spark/deploy/worker/ExecutorRunner.scala#L118

  

Can someone guide me on identifying the root cause for this issue or if my
assumption of the Exectuor having a status of EXITED at the end of
execution is not correct?






--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



how to create a DataType Object using the String representation in Java using Spark 2.2.0?

2018-01-25 Thread kant kodali
Hi All,

I have a datatype "IntegerType" represented as a String and now I want to
create DataType object out of that. I couldn't find in the DataType or
DataTypes api on how to do that?

Thanks!


Re: Get broadcast (set in one method) in another method

2018-01-25 Thread Gourav Sengupta
Hi,

Just out of curiosity, in what sort of programming or designing paradigm
does this way of solving things fit in? In case you are trying functional
programming do you think that currying will help?

Regards,
Gourav Sengupta


On Thu, Jan 25, 2018 at 8:04 PM, Margusja  wrote:

> Hi
>
> Maybe I am overthinking. I’d like to set broadcast in object A method y
>  and get it in object A method x.
>
> In example:
>
> object A {
>
> def main (args: Array[String]) {
> y()
> x()
> }
>
> def x() : Unit = {
> val a = bcA.value
> ...
> }
>
> def y(): String = {
> val bcA = sc.broadcast(a)
> …
> return “String value"
> }
>
> }
>
>
>
>
> ---
> Br
> Margus
>


Get broadcast (set in one method) in another method

2018-01-25 Thread Margusja
Hi

Maybe I am overthinking. I’d like to set broadcast in object A method y  and 
get it in object A method x.

In example:

object A {

def main (args: Array[String]) {
y()
x()
}

def x() : Unit = {
val a = bcA.value
...
}

def y(): String = {
val bcA = sc.broadcast(a) 
…
return “String value"
}

}




---
Br
Margus

Custom build - missing images on MasterWebUI

2018-01-25 Thread Conconscious
Hi list,

I'm trying to make a custom build of Spark, but in the end on Web UI
there's no images.

Some help please.

Build from:

git checkout v2.2.1

./dev/make-distribution.sh --name custom-spark --pip --tgz -Psparkr
-Phadoop-2.7 -Dhadoop.version=2.7.3 -Phive -Phive-thriftserver -Pmesos
-Pyarn -DskipTests clean package

Thanks in advance


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Apache Hadoop and Spark

2018-01-25 Thread jamison.bennett
Hi Mutahir,

I will try to answer some of your questions.

Q1) Can we use Mapreduce and apache spark in the same cluster
Yes. I run a cluster with both MapReduce2 and Spark and I use Yarn as the
resource manager.

Q2) is it mandatory to use GPUs for apache spark?
No. My cluster has Spark and does not have any GPUs.

Q3) I read that apache spark is in-memory, will it benefit from SSD / Flash
for caching or persistent storage?
As you noted, Spark is in-memory but there may be a few places that faster
storage may benefit including:
- Storage of the data file data read into Spark from HDFS DataNodes
-  RDD persistence

  
when caching includes one of the disk options
- Spark shuffle service - Between Spark stages which process the data
in-memory, intermediate results from Spark executors are written to storage
and served to the next stage by the shuffle service.
I don't have any benchmark results for these, but it might be something you
want to look into.

Thanks,
Jamison



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: S3 token times out during data frame "write.csv"

2018-01-25 Thread Jean Georges Perrin
Are you writing from an Amazon instance or from a on premise install to S3?
How many partitions are you writing from? Maybe you can try to “play” with 
repartitioning to see how it behaves?

> On Jan 23, 2018, at 17:09, Vasyl Harasymiv  wrote:
> 
> It is about 400 million rows. S3 automatically chunks the file on their end 
> while writing, so that's fine, e.g. creates the same file name with 
> alphanumeric suffixes. 
> However, the write session expires due to token expiration. 
> 
> On Tue, Jan 23, 2018 at 5:03 PM, Jörn Franke  > wrote:
>  How large is the file?
> 
> If it is very large then you should have anyway several partitions for the 
> output. This is also important in case you need to read again from S3 - 
> having several files there enables parallel reading.
> 
> On 23. Jan 2018, at 23:58, Vasyl Harasymiv  > wrote:
> 
>> Hi Spark Community,
>> 
>> Saving a data frame into a file on S3 using:
>> 
>> df.write.csv(s3_location)
>> 
>> If run for longer than 30 mins, the following error persists:
>> 
>> The provided token has expired. (Service: Amazon S3; Status Code: 400; Error 
>> Code: ExpiredToken;`)
>> 
>> Potentially, because there is a hardcoded session limit in temporary S3 
>> connection from Spark.
>> 
>> One can specify the duration as per here:
>> 
>> https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_request.html
>>  
>> 
>> 
>> One can, of course, chunk data into sub-30 min writes. However, Is there a 
>> way to change the token expiry parameter directly in Spark before using 
>> "write.csv"?
>> 
>> Thanks a lot for any help!
>> Vasyl
>> 
>> 
>> 
>> 
>> 
>> On Tue, Jan 23, 2018 at 2:46 PM, Toy > > wrote:
>> Thanks, I get this error when I switched to s3a://
>> 
>> Exception in thread "streaming-job-executor-0" java.lang.NoSuchMethodError: 
>> com.amazonaws.services.s3.transfer.TransferManager.(Lcom/amazonaws/services/s3/AmazonS3;Ljava/util/concurrent/ThreadPoolExecutor;)V
>>  at 
>> org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:287)
>>  at 
>> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
>> 
>> On Tue, 23 Jan 2018 at 15:05 Patrick Alwell > > wrote:
>> Spark cannot read locally from S3 without an S3a protocol; you’ll more than 
>> likely need a local copy of the data or you’ll need to utilize the proper 
>> jars to enable S3 communication from the edge to the datacenter.
>> 
>>  
>> 
>> https://stackoverflow.com/questions/30385981/how-to-access-s3a-files-from-apache-spark
>>  
>> 
>>  
>> 
>> Here are the jars: 
>> https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws 
>> 
>>  
>> 
>> Looks like you already have them, in which case you’ll have to make small 
>> configuration changes, e.g. s3 à s3a
>> 
>>  
>> 
>> Keep in mind: The Amazon JARs have proven very brittle: the version of the 
>> Amazon libraries must match the versions against which the Hadoop binaries 
>> were built.
>> 
>>  
>> 
>> https://hortonworks.github.io/hdp-aws/s3-s3aclient/index.html#using-the-s3a-filesystem-client
>>  
>> 
>>  
>> 
>>  
>> 
>>  
>> 
>>  
>> 
>> From: Toy >
>> Date: Tuesday, January 23, 2018 at 11:33 AM
>> To: "user@spark.apache.org " 
>> >
>> Subject: I can't save DataFrame from running Spark locally
>> 
>>  
>> 
>> Hi,
>> 
>>  
>> 
>> First of all, my Spark application runs fine in AWS EMR. However, I'm trying 
>> to run it locally to debug some issue. My application is just to parse log 
>> files and convert to DataFrame then convert to ORC and save to S3. However, 
>> when I run locally I get this error
>> 
>>  
>> 
>> java.io.IOException: /orc/dt=2018-01-23 doesn't exist
>> 
>> at 
>> org.apache.hadoop.fs.s3.Jets3tFileSystemStore.get(Jets3tFileSystemStore.java:170)
>> 
>> at 
>> org.apache.hadoop.fs.s3.Jets3tFileSystemStore.retrieveINode(Jets3tFileSystemStore.java:221)
>> 
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> 
>> at 
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>> 
>> at 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> 
>> at java.lang.reflect.Method.invoke(Method.java:497)
>> 
>> at 
>> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
>> 
>> 

Kafka deserialization to Structured Streaming SQL - Encoders.bean result doesn't match itself?

2018-01-25 Thread Iain Cundy
Hi All

I'm trying to move from MapWithState to Structured Streaming v2.2.1, but I've 
run into a problem. 

To convert from Kafka data with a binary (protobuf) value to SQL I'm taking the 
dataset from readStream and doing 

Dataset s = dataset.selectExpr("timestamp", "CAST(key as string)", 
"ETBinnedDeserialize(value) AS message");

ETBinnedDeserialize is a UDF

spark.udf().register("ETBinnedDeserialize",
(UDF1) ETProtobufDecoder::deserialize, 
Encoders.bean(BinnedForET.class).schema());

ETProtobufDecoder::deserialize looks like this

public static Object deserialize(byte[] bytes) {
ExpressionEncoder expressionEncoder = 
(ExpressionEncoder) Encoders.bean(BinnedForET.class);
BinnedForET binned =  // Convert binary to pojo
InternalRow row = expressionEncoder.toRow(binned);
return row;
}

The key point from all this is that the schema for message is from 
Encoders.bean(BinnedForET.class) and the object the UDF returns is the result 
of the same Encoders toRow method.
Yet there is a scala mismatch. So if not toRow what should I be calling?

Here is the error 
org.apache.spark.SparkException: Failed to execute user defined 
function($anonfun$27: (binary) => 
struct)

Caused by: scala.MatchError: [0,16,c,0,0,1,576c,7e2] (of class 
org.apache.spark.sql.catalyst.expressions.UnsafeRow)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:236)

I've reduced the class to 7 ints, to make it as simple as possible. So no 
Strings that make the scala struct more complicated. 

The scala object above seems to be a struct with an initial zero (a null 
indicator?) followed by the 7 ints I expect in hex, but doesn't match? Maybe 
it's obviously wrong to a scala programmer?

Any ideas what I should be calling instead of (or after?) toRow to return the 
right thing?

Cheers
Iain 

This message and the information contained herein is proprietary and 
confidential and subject to the Amdocs policy statement,

you may review at https://www.amdocs.com/about/email-disclaimer 



-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org