Re: Quick one... AWS SDK version?

2017-10-07 Thread Jean Georges Perrin

Hey Marco,

I am actually reading from S3 and I use 2.7.3, but I inherited the project and 
they use some AWS API from Amazon SDK, which version is like from yesterday :) 
so it’s confused and AMZ is changing its version like crazy so it’s a little 
difficult to follow. Right now I went back to 2.7.3 and SDK 1.7.4...

jg


> On Oct 7, 2017, at 15:34, Marco Mistroni  wrote:
> 
> Hi JG
>  out of curiosity what's ur usecase? are you writing to S3? you could use 
> Spark to do that , e.g using hadoop package  
> org.apache.hadoop:hadoop-aws:2.7.1 ..that will download the aws client which 
> is in line with hadoop 2.7.1?
> 
> hth
>  marco
> 
>> On Fri, Oct 6, 2017 at 10:58 PM, Jonathan Kelly  
>> wrote:
>> Note: EMR builds Hadoop, Spark, et al, from source against specific versions 
>> of certain packages like the AWS Java SDK, httpclient/core, Jackson, etc., 
>> sometimes requiring some patches in these applications in order to work with 
>> versions of these dependencies that differ from what the applications may 
>> support upstream.
>> 
>> For emr-5.8.0, we have built Hadoop and Spark (the Spark Kinesis connector, 
>> that is, since that's the only part of Spark that actually depends upon the 
>> AWS Java SDK directly) against AWS Java SDK 1.11.160 instead of the much 
>> older version that vanilla Hadoop 2.7.3 would otherwise depend upon.
>> 
>> ~ Jonathan
>> 
>>> On Wed, Oct 4, 2017 at 7:17 AM Steve Loughran  
>>> wrote:
 On 3 Oct 2017, at 21:37, JG Perrin  wrote:
 
 Sorry Steve – I may not have been very clear: thinking about 
 aws-java-sdk-z.yy.xxx.jar. To the best of my knowledge, none is bundled 
 with Spark.
>>> 
>>> 
>>> I know, but if you are talking to s3 via the s3a client, you will need the 
>>> SDK version to match the hadoop-aws JAR of the same version of Hadoop your 
>>> JARs have. Similarly, if you were using spark-kinesis, it needs to be in 
>>> sync there. 
  
 From: Steve Loughran [mailto:ste...@hortonworks.com] 
 Sent: Tuesday, October 03, 2017 2:20 PM
 To: JG Perrin 
 Cc: user@spark.apache.org
 Subject: Re: Quick one... AWS SDK version?
  
  
 On 3 Oct 2017, at 02:28, JG Perrin  wrote:
  
 Hey Sparkians,
  
 What version of AWS Java SDK do you use with Spark 2.2? Do you stick with 
 the Hadoop 2.7.3 libs?
  
 You generally to have to stick with the version which hadoop was built 
 with I'm afraid...very brittle dependency. 
> 


Re: Quick one... AWS SDK version?

2017-10-07 Thread Marco Mistroni
Hi JG
 out of curiosity what's ur usecase? are you writing to S3? you could use
Spark to do that , e.g using hadoop package
org.apache.hadoop:hadoop-aws:2.7.1 ..that will download the aws client
which is in line with hadoop 2.7.1?

hth
 marco

On Fri, Oct 6, 2017 at 10:58 PM, Jonathan Kelly 
wrote:

> Note: EMR builds Hadoop, Spark, et al, from source against specific
> versions of certain packages like the AWS Java SDK, httpclient/core,
> Jackson, etc., sometimes requiring some patches in these applications in
> order to work with versions of these dependencies that differ from what the
> applications may support upstream.
>
> For emr-5.8.0, we have built Hadoop and Spark (the Spark Kinesis
> connector, that is, since that's the only part of Spark that actually
> depends upon the AWS Java SDK directly) against AWS Java SDK 1.11.160
> instead of the much older version that vanilla Hadoop 2.7.3 would otherwise
> depend upon.
>
> ~ Jonathan
>
> On Wed, Oct 4, 2017 at 7:17 AM Steve Loughran 
> wrote:
>
>> On 3 Oct 2017, at 21:37, JG Perrin  wrote:
>>
>> Sorry Steve – I may not have been very clear: thinking about
>> aws-java-sdk-z.yy.xxx.jar. To the best of my knowledge, none is bundled
>> with Spark.
>>
>>
>>
>> I know, but if you are talking to s3 via the s3a client, you will need
>> the SDK version to match the hadoop-aws JAR of the same version of Hadoop
>> your JARs have. Similarly, if you were using spark-kinesis, it needs to be
>> in sync there.
>>
>>
>> *From:* Steve Loughran [mailto:ste...@hortonworks.com
>> ]
>> *Sent:* Tuesday, October 03, 2017 2:20 PM
>> *To:* JG Perrin 
>> *Cc:* user@spark.apache.org
>> *Subject:* Re: Quick one... AWS SDK version?
>>
>>
>>
>> On 3 Oct 2017, at 02:28, JG Perrin  wrote:
>>
>> Hey Sparkians,
>>
>> What version of AWS Java SDK do you use with Spark 2.2? Do you stick with
>> the Hadoop 2.7.3 libs?
>>
>>
>> You generally to have to stick with the version which hadoop was built
>> with I'm afraid...very brittle dependency.
>>
>>


Cases when to clear the checkpoint directories.

2017-10-07 Thread John, Vishal (Agoda)


Hello TD,

You had replied to one of the questions about checkpointing –

This is an unfortunate design on my part when I was building DStreams :)

Fortunately, we learnt from our mistakes and built Structured Streaming the 
correct way. Checkpointing in Structured Streaming stores only the progress 
information (offsets, etc.), and the user can change their application code 
(within certain constraints, of course) and still restart from checkpoints 
(unlike DStreams). If you are just building out your streaming applications, 
then I highly recommend you to try out Structured Streaming instead of DStreams 
(which is effectively in maintenance mode).

Can you please elaborate on what you mean by application code change in DStream 
applications?

If I add a couple of println statements in my application code will that become 
an application code change? or do you mean, changing method signatures or 
adding new methods etc.
Could you please point to relevant source code in Spark, which does this type 
of code validation/de-serialisation in case of DStreams?

We are using mapWithState in our application and it builds its state from 
checkpointed RDDs.  I would like understand the cases where we can avoid 
clearing the checkpoint directories.


thanks in advance,
Vishal



This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.


Re: DataFrame multiple agg on the same column

2017-10-07 Thread yohann jardin
Hey Somasundaram,

Using a map is only one way to use the function agg. For the complete list: 
https://spark.apache.org/docs/1.5.2/api/java/org/apache/spark/sql/GroupedData.html

Using the first one: 
agg(Column
 expr, 
Column...
 exprs)
grouped_txn.agg(count(lit(1)), sum('amount), max('amount), min('create_time), 
max('created_time)).show

Yohann Jardin

Le 10/7/2017 à 7:12 PM, Somasundaram Sekar a écrit :
Hi,

I have a GroupedData object, on which I perform aggregation of few columns 
since GroupedData takes in map, I cannot perform multiple aggregate on the same 
column, say I want to have both max and min of amount.

So the below line of code will return only one aggregate per column

grouped_txn.agg({'*' : 'count', 'amount' : 'sum', 'amount' : 'max', 
'created_time' : 'min', 'created_time' : 'max'})

What are the possible alternatives, I can have a new column defined, that is 
just a copy of the original and use that, but that looks ugly any suggestions?

Thanks,
Somasundaram S



DataFrame multiple agg on the same column

2017-10-07 Thread Somasundaram Sekar
Hi,

I have a GroupedData object, on which I perform aggregation of few columns
since GroupedData takes in map, I cannot perform multiple aggregate on the
same column, say I want to have both max and min of amount.

So the below line of code will return only one aggregate per column

grouped_txn.agg({'*' : 'count', 'amount' : 'sum', 'amount' : 'max',
'created_time' : 'min', 'created_time' : 'max'})

What are the possible alternatives, I can have a new column defined, that
is just a copy of the original and use that, but that looks ugly any
suggestions?

Thanks,
Somasundaram S


Re: How to convert Array of Json rows into Dataset of specific columns in Spark 2.2.0?

2017-10-07 Thread Jules Damji
You might find these blogs helpful to parse & extract data from complex 
structures:

https://databricks.com/blog/2017/06/27/4-sql-high-order-lambda-functions-examine-complex-structured-data-databricks.html

https://databricks.com/blog/2017/06/13/five-spark-sql-utility-functions-extract-explore-complex-data-types.html

Cheers 
Jules


Sent from my iPhone
Pardon the dumb thumb typos :)

> On Oct 7, 2017, at 12:30 AM, kant kodali  wrote:
> 
> I have a Dataset ds which consists of json rows.
> 
> Sample Json Row (This is just an example of one row in the dataset)
> 
> [ 
> {"name": "foo", "address": {"state": "CA", "country": "USA"}, 
> "docs":[{"subject": "english", "year": 2016}]}
> {"name": "bar", "address": {"state": "OH", "country": "USA"}, 
> "docs":[{"subject": "math", "year": 2017}]}
> 
> ]
> ds.printSchema()
> 
> root
>  |-- value: string (nullable = true)
> Now I want to convert into the following dataset using Spark 2.2.0
> 
> name  | address   |  docs 
> --
> "foo" | {"state": "CA", "country": "USA"} | [{"subject": "english", "year": 
> 2016}]
> "bar" | {"state": "OH", "country": "USA"} | [{"subject": "math", "year": 
> 2017}]
> Preferably Java but Scala is also fine as long as there are functions 
> available in Java API


[spark-core] SortShuffleManager - when to enable Serialized sorting

2017-10-07 Thread Weitong Chen
hi,

   Why check dependency.aggregator but not dependency.mapSideCombine in
canUseSerializedShuffle?
   In BaseShuffle' SortShuffleWriter, dep.mapSideCombine decides
dep.aggregator is passed to sorter or not.
   
  *canUseSerializedShuffle*
  /**
   * Helper method for determining whether a shuffle should use an optimized
serialized shuffle
   * path or whether it should fall back to the original path that operates
on deserialized objects.
   */
  def canUseSerializedShuffle(dependency: ShuffleDependency[_, _, _]):
Boolean = {
val shufId = dependency.shuffleId
val numPartitions = dependency.partitioner.numPartitions
if (!dependency.serializer.supportsRelocationOfSerializedObjects) {
  log.debug(s"Can't use serialized shuffle for shuffle $shufId because
the serializer, " +
s"${dependency.serializer.getClass.getName}, does not support object
relocation")
  false
} else if *(dependency.aggregator.isDefined*) {
  log.debug(
s"Can't use serialized shuffle for shuffle $shufId because an
aggregator is defined")
  false
} else if (numPartitions >
MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE) {
  log.debug(s"Can't use serialized shuffle for shuffle $shufId because
it has more than " +
s"$MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE partitions")
  false
} else {
  log.debug(s"Can use serialized shuffle for shuffle $shufId")
  true
}
  }
}
 
*SortShuffleWriter*
private[spark] class SortShuffleWriter[K, V, C](
 ...
 /** Write a bunch of records to this task's output */
  override def write(records: Iterator[Product2[K, V]]): Unit = {
sorter = if *(dep.mapSideCombine*) {
  require(dep.aggregator.isDefined, "Map-side combine without Aggregator
specified!")
  new ExternalSorter[K, V, C](
context,* dep.aggregato*r, Some(dep.partitioner), dep.keyOrdering,
dep.serializer)
} else {
  // In this case we pass neither an aggregator nor an ordering to the
sorter, because we don't
  // care whether the keys get sorted in each partition; that will be
done on the reduce side
  // if the operation being run is sortByKey.
  new ExternalSorter[K, V, V](
context, *aggregator = None*, Some(dep.partitioner), ordering =
None, dep.serializer)
}
sorter.insertAll(records)

   



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How to convert Array of Json rows into Dataset of specific columns in Spark 2.2.0?

2017-10-07 Thread Matteo Cossu
Hello,
I think you should use *from_json *from spark.sql.functions

to parse the json string and convert it to a StructType. Afterwards, you
can create a new DataSet by selecting the columns you want.

On 7 October 2017 at 09:30, kant kodali  wrote:

> I have a Dataset ds which consists of json rows.
>
> *Sample Json Row (This is just an example of one row in the dataset)*
>
> [
> {"name": "foo", "address": {"state": "CA", "country": "USA"}, 
> "docs":[{"subject": "english", "year": 2016}]}
> {"name": "bar", "address": {"state": "OH", "country": "USA"}, 
> "docs":[{"subject": "math", "year": 2017}]}
>
> ]
>
> ds.printSchema()
>
> root
>  |-- value: string (nullable = true)
>
> Now I want to convert into the following dataset using Spark 2.2.0
>
> name  | address   |  docs
> --
> "foo" | {"state": "CA", "country": "USA"} | [{"subject": "english", "year": 
> 2016}]
> "bar" | {"state": "OH", "country": "USA"} | [{"subject": "math", "year": 
> 2017}]
>
> Preferably Java but Scala is also fine as long as there are functions
> available in Java API
>


How to convert Array of Json rows into Dataset of specific columns in Spark 2.2.0?

2017-10-07 Thread kant kodali
I have a Dataset ds which consists of json rows.

*Sample Json Row (This is just an example of one row in the dataset)*

[
{"name": "foo", "address": {"state": "CA", "country": "USA"},
"docs":[{"subject": "english", "year": 2016}]}
{"name": "bar", "address": {"state": "OH", "country": "USA"},
"docs":[{"subject": "math", "year": 2017}]}

]

ds.printSchema()

root
 |-- value: string (nullable = true)

Now I want to convert into the following dataset using Spark 2.2.0

name  | address   |  docs
--
"foo" | {"state": "CA", "country": "USA"} | [{"subject": "english",
"year": 2016}]
"bar" | {"state": "OH", "country": "USA"} | [{"subject": "math", "year": 2017}]

Preferably Java but Scala is also fine as long as there are functions
available in Java API