spark hive branch location

2015-10-05 Thread weoccc
Hi,

I would like to know where is the spark hive github location where spark
build depend on ? I was told it used to be here
https://github.com/pwendell/hive but it seems it is no longer there.

Thanks a lot,

Weide


Re: spark hive branch location

2015-10-05 Thread Michael Armbrust
I think this is the most up to date branch (used in Spark 1.5):
https://github.com/pwendell/hive/tree/release-1.2.1-spark

On Mon, Oct 5, 2015 at 1:03 PM, weoccc  wrote:

> Hi,
>
> I would like to know where is the spark hive github location where spark
> build depend on ? I was told it used to be here
> https://github.com/pwendell/hive but it seems it is no longer there.
>
> Thanks a lot,
>
> Weide
>


Re: Spark 1.5.1 - Scala 2.10 - Hadoop 1 package is missing from S3

2015-10-05 Thread Patrick Wendell
The missing artifacts are uploaded now. Things should propagate in the next
24 hours. If there are still issues past then ping this thread. Thanks!

- Patrick

On Mon, Oct 5, 2015 at 2:41 PM, Nicholas Chammas  wrote:

> Thanks for looking into this Josh.
>
> On Mon, Oct 5, 2015 at 5:39 PM Josh Rosen 
> wrote:
>
>> I'm working on a fix for this right now. I'm planning to re-run a
>> modified copy of the release packaging scripts which will emit only the
>> missing artifacts (so we won't upload new artifacts with different SHAs for
>> the builds which *did* succeed).
>>
>> I expect to have this finished in the next day or so; I'm currently
>> blocked by some infra downtime but expect that to be resolved soon.
>>
>> - Josh
>>
>> On Mon, Oct 5, 2015 at 8:46 AM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> Blaž said:
>>>
>>> Also missing is
>>> http://s3.amazonaws.com/spark-related-packages/spark-1.5.1-bin-hadoop1.tgz
>>> which breaks spark-ec2 script.
>>>
>>> This is the package I am referring to in my original email.
>>>
>>> Nick said:
>>>
>>> It appears that almost every version of Spark up to and including 1.5.0
>>> has included a —bin-hadoop1.tgz release (e.g. spark-1.5.0-bin-hadoop1.tgz).
>>> However, 1.5.1 has no such package.
>>>
>>> Nick
>>> ​
>>>
>>> On Mon, Oct 5, 2015 at 3:27 AM Blaž Šnuderl  wrote:
>>>
 Also missing is http://s3.amazonaws.com/spark-related-packages/spark-
 1.5.1-bin-hadoop1.tgz which breaks spark-ec2 script.

 On Mon, Oct 5, 2015 at 5:20 AM, Ted Yu  wrote:

> hadoop1 package for Scala 2.10 wasn't in RC1 either:
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-bin/
>
> On Sun, Oct 4, 2015 at 5:17 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> I’m looking here:
>>
>> https://s3.amazonaws.com/spark-related-packages/
>>
>> I believe this is where one set of official packages is published.
>> Please correct me if this is not the case.
>>
>> It appears that almost every version of Spark up to and including
>> 1.5.0 has included a --bin-hadoop1.tgz release (e.g.
>> spark-1.5.0-bin-hadoop1.tgz).
>>
>> However, 1.5.1 has no such package. There is a
>> spark-1.5.1-bin-hadoop1-scala2.11.tgz package, but this is a
>> separate thing. (1.5.0 also has a hadoop1-scala2.11 package.)
>>
>> Was this intentional?
>>
>> More importantly, is there some rough specification for what packages
>> we should be able to expect in this S3 bucket with every release?
>>
>> This is important for those of us who depend on this publishing venue
>> (e.g. spark-ec2 and related tools).
>>
>> Nick
>> ​
>>
>
>

>>


HiveContext in standalone mode: shuffle hang ups

2015-10-05 Thread Saif.A.Ellafi
Hi all,

I have a process where local mode takes only 40 seconds. While the same on 
stand-alone mode, being the same node used for local mode the only available 
node, is taking up for ever. rdd actions hang up.

I could only "sort this out" by turning speculation on, so the same task 
hanging is retried and then it works.

Sadly, this does not help on huge tasks.

How could I diagnose this fruther?

Thanks,
Saif



Re: Spark 1.5.1 - Scala 2.10 - Hadoop 1 package is missing from S3

2015-10-05 Thread Blaž Šnuderl
Also missing is
http://s3.amazonaws.com/spark-related-packages/spark-1.5.1-bin-hadoop1.tgz
which breaks spark-ec2 script.

On Mon, Oct 5, 2015 at 5:20 AM, Ted Yu  wrote:

> hadoop1 package for Scala 2.10 wasn't in RC1 either:
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-bin/
>
> On Sun, Oct 4, 2015 at 5:17 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> I’m looking here:
>>
>> https://s3.amazonaws.com/spark-related-packages/
>>
>> I believe this is where one set of official packages is published. Please
>> correct me if this is not the case.
>>
>> It appears that almost every version of Spark up to and including 1.5.0
>> has included a --bin-hadoop1.tgz release (e.g.
>> spark-1.5.0-bin-hadoop1.tgz).
>>
>> However, 1.5.1 has no such package. There is a
>> spark-1.5.1-bin-hadoop1-scala2.11.tgz package, but this is a separate
>> thing. (1.5.0 also has a hadoop1-scala2.11 package.)
>>
>> Was this intentional?
>>
>> More importantly, is there some rough specification for what packages we
>> should be able to expect in this S3 bucket with every release?
>>
>> This is important for those of us who depend on this publishing venue
>> (e.g. spark-ec2 and related tools).
>>
>> Nick
>> ​
>>
>
>


Re: Spark 1.5.1 - Scala 2.10 - Hadoop 1 package is missing from S3

2015-10-05 Thread Josh Rosen
I'm working on a fix for this right now. I'm planning to re-run a modified
copy of the release packaging scripts which will emit only the missing
artifacts (so we won't upload new artifacts with different SHAs for the
builds which *did* succeed).

I expect to have this finished in the next day or so; I'm currently blocked
by some infra downtime but expect that to be resolved soon.

- Josh

On Mon, Oct 5, 2015 at 8:46 AM, Nicholas Chammas  wrote:

> Blaž said:
>
> Also missing is
> http://s3.amazonaws.com/spark-related-packages/spark-1.5.1-bin-hadoop1.tgz
> which breaks spark-ec2 script.
>
> This is the package I am referring to in my original email.
>
> Nick said:
>
> It appears that almost every version of Spark up to and including 1.5.0
> has included a —bin-hadoop1.tgz release (e.g. spark-1.5.0-bin-hadoop1.tgz).
> However, 1.5.1 has no such package.
>
> Nick
> ​
>
> On Mon, Oct 5, 2015 at 3:27 AM Blaž Šnuderl  wrote:
>
>> Also missing is http://s3.amazonaws.com/spark-related-packages/spark-
>> 1.5.1-bin-hadoop1.tgz which breaks spark-ec2 script.
>>
>> On Mon, Oct 5, 2015 at 5:20 AM, Ted Yu  wrote:
>>
>>> hadoop1 package for Scala 2.10 wasn't in RC1 either:
>>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-bin/
>>>
>>> On Sun, Oct 4, 2015 at 5:17 PM, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
 I’m looking here:

 https://s3.amazonaws.com/spark-related-packages/

 I believe this is where one set of official packages is published.
 Please correct me if this is not the case.

 It appears that almost every version of Spark up to and including 1.5.0
 has included a --bin-hadoop1.tgz release (e.g.
 spark-1.5.0-bin-hadoop1.tgz).

 However, 1.5.1 has no such package. There is a
 spark-1.5.1-bin-hadoop1-scala2.11.tgz package, but this is a separate
 thing. (1.5.0 also has a hadoop1-scala2.11 package.)

 Was this intentional?

 More importantly, is there some rough specification for what packages
 we should be able to expect in this S3 bucket with every release?

 This is important for those of us who depend on this publishing venue
 (e.g. spark-ec2 and related tools).

 Nick
 ​

>>>
>>>
>>


Re: failure notice

2015-10-05 Thread Renyi Xiong
if RDDs from same DStream not guaranteed to run on same worker, then the
question becomes:

is it possible to specify an unlimited duration in ssc to have a continuous
stream (as opposed to discretized).

say, we have a per node streaming engine (built-in checkpoint and recovery)
we'd like to integrate with spark streaming. can we have a never-ending
batch (or RDD) this way?

On Mon, Sep 28, 2015 at 4:31 PM,  wrote:

> Hi. This is the qmail-send program at apache.org.
> I'm afraid I wasn't able to deliver your message to the following
> addresses.
> This is a permanent error; I've given up. Sorry it didn't work out.
>
> :
> Must be sent from an @apache.org address or a subscriber address or an
> address in LDAP.
>
> --- Below this line is a copy of the message.
>
> Return-Path: 
> Received: (qmail 95559 invoked by uid 99); 28 Sep 2015 23:31:46 -
> Received: from Unknown (HELO spamd3-us-west.apache.org) (209.188.14.142)
> by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 28 Sep 2015 23:31:46
> +
> Received: from localhost (localhost [127.0.0.1])
> by spamd3-us-west.apache.org (ASF Mail Server at
> spamd3-us-west.apache.org) with ESMTP id 96E361809BA
> for ; Mon, 28 Sep 2015 23:31:45 + (UTC)
> X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org
> X-Spam-Flag: NO
> X-Spam-Score: 3.129
> X-Spam-Level: ***
> X-Spam-Status: No, score=3.129 tagged_above=-999 required=6.31
> tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
> FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=3,
> RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001]
> autolearn=disabled
> Authentication-Results: spamd3-us-west.apache.org (amavisd-new);
> dkim=pass (2048-bit key) header.d=gmail.com
> Received: from mx1-us-west.apache.org ([10.40.0.8])
> by localhost (spamd3-us-west.apache.org [10.40.0.10])
> (amavisd-new, port 10024)
> with ESMTP id FAGoohFE7Y7A for ;
> Mon, 28 Sep 2015 23:31:44 + (UTC)
> Received: from mail-la0-f51.google.com (mail-la0-f51.google.com
> [209.85.215.51])
> by mx1-us-west.apache.org (ASF Mail Server at
> mx1-us-west.apache.org) with ESMTPS id 2ED40204C9
> for ; Mon, 28 Sep 2015 23:31:44 + (UTC)
> Received: by labzv5 with SMTP id zv5so32919088lab.1
> for ; Mon, 28 Sep 2015 16:31:42 -0700 (PDT)
> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
> d=gmail.com; s=20120113;
>
> h=mime-version:in-reply-to:references:date:message-id:subject:from:to
>  :cc:content-type;
> bh=F36l+I4dfDHTL7nQ0K9mAW4aVtPpVpYc0rWWpPNjt4c=;
>
> b=QfRdLEWf4clJqwkZSH7n0oHjXLNifWdhYxvCDZ+P37oSfM0vd/8Bx962hTflRQkD1q
>
>  2B3go7g8bpnQlhZgMRrZfT6hk7vUtNA3lOZjYeN+cPyoVRaBwm3LIID5vF4cw5hFAWaM
>
>  LUenU7E7b9kJY8JkyhIfpya8CLKz+Yo6EjCv3W6BAvv2YiNPgbOQkpx7u8y9dw0kHGig
>
>  1hv37Ey/DZpoKCgbSesv+sztYslevu+VBgxHFkveEyxH1saRr6OqTM7fnL2E6fP4E8qu
>
>  W81G1ZfNW1Pp9i5IcCb/9S7YTZDnBlUj4yROsOfNANRGmed71QpQD9l8NnAQXmeqoeNF
>  SyEg==
> MIME-Version: 1.0
> X-Received: by 10.25.213.75 with SMTP id m72mr4047578lfg.17.1443483102618;
>  Mon, 28 Sep 2015 16:31:42 -0700 (PDT)
> Received: by 10.25.207.18 with HTTP; Mon, 28 Sep 2015 16:31:42 -0700 (PDT)
> In-Reply-To:  2...@mail.gmail.com>
> References: <
> cangsv6-k+33gvgtiynwhz2gsbudf_wwwnazvupbqe8qdcg_...@mail.gmail.com>
> 

Re: failure notice

2015-10-05 Thread Tathagata Das
What happens when a whole node running  your " per node streaming engine
(built-in checkpoint and recovery)" fails? Can its checkpoint and recovery
mechanism handle whole node failure? Can you recover from the checkpoint on
a different node?

Spark and Spark Streaming were designed with the idea that executors are
disposable, and there should not be any node-specific long term state that
you rely on unless you can recover that state on a different node.

On Mon, Oct 5, 2015 at 3:03 PM, Renyi Xiong  wrote:

> if RDDs from same DStream not guaranteed to run on same worker, then the
> question becomes:
>
> is it possible to specify an unlimited duration in ssc to have a
> continuous stream (as opposed to discretized).
>
> say, we have a per node streaming engine (built-in checkpoint and
> recovery) we'd like to integrate with spark streaming. can we have a
> never-ending batch (or RDD) this way?
>
> On Mon, Sep 28, 2015 at 4:31 PM,  wrote:
>
>> Hi. This is the qmail-send program at apache.org.
>> I'm afraid I wasn't able to deliver your message to the following
>> addresses.
>> This is a permanent error; I've given up. Sorry it didn't work out.
>>
>> :
>> Must be sent from an @apache.org address or a subscriber address or an
>> address in LDAP.
>>
>> --- Below this line is a copy of the message.
>>
>> Return-Path: 
>> Received: (qmail 95559 invoked by uid 99); 28 Sep 2015 23:31:46 -
>> Received: from Unknown (HELO spamd3-us-west.apache.org) (209.188.14.142)
>> by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 28 Sep 2015 23:31:46
>> +
>> Received: from localhost (localhost [127.0.0.1])
>> by spamd3-us-west.apache.org (ASF Mail Server at
>> spamd3-us-west.apache.org) with ESMTP id 96E361809BA
>> for ; Mon, 28 Sep 2015 23:31:45 +
>> (UTC)
>> X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org
>> X-Spam-Flag: NO
>> X-Spam-Score: 3.129
>> X-Spam-Level: ***
>> X-Spam-Status: No, score=3.129 tagged_above=-999 required=6.31
>> tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
>> FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=3,
>> RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001]
>> autolearn=disabled
>> Authentication-Results: spamd3-us-west.apache.org (amavisd-new);
>> dkim=pass (2048-bit key) header.d=gmail.com
>> Received: from mx1-us-west.apache.org ([10.40.0.8])
>> by localhost (spamd3-us-west.apache.org [10.40.0.10])
>> (amavisd-new, port 10024)
>> with ESMTP id FAGoohFE7Y7A for ;
>> Mon, 28 Sep 2015 23:31:44 + (UTC)
>> Received: from mail-la0-f51.google.com (mail-la0-f51.google.com
>> [209.85.215.51])
>> by mx1-us-west.apache.org (ASF Mail Server at
>> mx1-us-west.apache.org) with ESMTPS id 2ED40204C9
>> for ; Mon, 28 Sep 2015 23:31:44 +
>> (UTC)
>> Received: by labzv5 with SMTP id zv5so32919088lab.1
>> for ; Mon, 28 Sep 2015 16:31:42 -0700
>> (PDT)
>> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
>> d=gmail.com; s=20120113;
>>
>> h=mime-version:in-reply-to:references:date:message-id:subject:from:to
>>  :cc:content-type;
>> bh=F36l+I4dfDHTL7nQ0K9mAW4aVtPpVpYc0rWWpPNjt4c=;
>>
>> b=QfRdLEWf4clJqwkZSH7n0oHjXLNifWdhYxvCDZ+P37oSfM0vd/8Bx962hTflRQkD1q
>>
>>  2B3go7g8bpnQlhZgMRrZfT6hk7vUtNA3lOZjYeN+cPyoVRaBwm3LIID5vF4cw5hFAWaM
>>
>>  LUenU7E7b9kJY8JkyhIfpya8CLKz+Yo6EjCv3W6BAvv2YiNPgbOQkpx7u8y9dw0kHGig
>>
>>  1hv37Ey/DZpoKCgbSesv+sztYslevu+VBgxHFkveEyxH1saRr6OqTM7fnL2E6fP4E8qu
>>
>>  W81G1ZfNW1Pp9i5IcCb/9S7YTZDnBlUj4yROsOfNANRGmed71QpQD9l8NnAQXmeqoeNF
>>  SyEg==
>> MIME-Version: 1.0
>> X-Received: by 10.25.213.75 with SMTP id m72mr4047578lfg.17.1443483102618;
>>  Mon, 28 Sep 2015 16:31:42 -0700 (PDT)
>> Received: by 10.25.207.18 with HTTP; Mon, 28 Sep 2015 16:31:42 -0700 (PDT)
>> In-Reply-To: > 2...@mail.gmail.com>
>> References: <
>> cangsv6-k+33gvgtiynwhz2gsbudf_wwwnazvupbqe8qdcg_...@mail.gmail.com>
>> 

Re: Spark 1.5.1 - Scala 2.10 - Hadoop 1 package is missing from S3

2015-10-05 Thread Nicholas Chammas
Thanks for looking into this Josh.

On Mon, Oct 5, 2015 at 5:39 PM Josh Rosen  wrote:

> I'm working on a fix for this right now. I'm planning to re-run a modified
> copy of the release packaging scripts which will emit only the missing
> artifacts (so we won't upload new artifacts with different SHAs for the
> builds which *did* succeed).
>
> I expect to have this finished in the next day or so; I'm currently
> blocked by some infra downtime but expect that to be resolved soon.
>
> - Josh
>
> On Mon, Oct 5, 2015 at 8:46 AM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Blaž said:
>>
>> Also missing is
>> http://s3.amazonaws.com/spark-related-packages/spark-1.5.1-bin-hadoop1.tgz
>> which breaks spark-ec2 script.
>>
>> This is the package I am referring to in my original email.
>>
>> Nick said:
>>
>> It appears that almost every version of Spark up to and including 1.5.0
>> has included a —bin-hadoop1.tgz release (e.g. spark-1.5.0-bin-hadoop1.tgz).
>> However, 1.5.1 has no such package.
>>
>> Nick
>> ​
>>
>> On Mon, Oct 5, 2015 at 3:27 AM Blaž Šnuderl  wrote:
>>
>>> Also missing is http://s3.amazonaws.com/spark-related-packages/spark-
>>> 1.5.1-bin-hadoop1.tgz which breaks spark-ec2 script.
>>>
>>> On Mon, Oct 5, 2015 at 5:20 AM, Ted Yu  wrote:
>>>
 hadoop1 package for Scala 2.10 wasn't in RC1 either:
 http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-bin/

 On Sun, Oct 4, 2015 at 5:17 PM, Nicholas Chammas <
 nicholas.cham...@gmail.com> wrote:

> I’m looking here:
>
> https://s3.amazonaws.com/spark-related-packages/
>
> I believe this is where one set of official packages is published.
> Please correct me if this is not the case.
>
> It appears that almost every version of Spark up to and including
> 1.5.0 has included a --bin-hadoop1.tgz release (e.g.
> spark-1.5.0-bin-hadoop1.tgz).
>
> However, 1.5.1 has no such package. There is a
> spark-1.5.1-bin-hadoop1-scala2.11.tgz package, but this is a separate
> thing. (1.5.0 also has a hadoop1-scala2.11 package.)
>
> Was this intentional?
>
> More importantly, is there some rough specification for what packages
> we should be able to expect in this S3 bucket with every release?
>
> This is important for those of us who depend on this publishing venue
> (e.g. spark-ec2 and related tools).
>
> Nick
> ​
>


>>>
>


Re: IllegalArgumentException: Size exceeds Integer.MAX_VALUE

2015-10-05 Thread Reynold Xin
You can write the data to local hdfs (or local disk) and just load it from
there.


On Mon, Oct 5, 2015 at 4:37 PM, Jegan  wrote:

> Thanks for your suggestion Ted.
>
> Unfortunately at this point of time I cannot go beyond 1000 partitions. I
> am writing this data to BigQuery and it has a limit of 1000 jobs per day
> for a table(they have some limits on this)  I currently create 1 load job
> per partition. Is there any other work-around?
>
> Thanks again.
>
> Regards,
> Jegan
>
> On Mon, Oct 5, 2015 at 3:53 PM, Ted Yu  wrote:
>
>> As a workaround, can you set the number of partitions higher in the
>> sc.textFile method ?
>>
>> Cheers
>>
>> On Mon, Oct 5, 2015 at 3:31 PM, Jegan  wrote:
>>
>>> Hi All,
>>>
>>> I am facing the below exception when the size of the file being read in
>>> a partition is above 2GB. This is apparently because Java's limitation on
>>> memory mapped files. It supports mapping only 2GB files.
>>>
>>> Caused by: java.lang.IllegalArgumentException: Size exceeds
>>> Integer.MAX_VALUE
>>> at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:836)
>>> at
>>> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:125)
>>> at
>>> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:113)
>>> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1207)
>>> at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:127)
>>> at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:134)
>>> at
>>> org.apache.spark.storage.DiskStore.putIterator(DiskStore.scala:102)
>>> at
>>> org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:791)
>>> at
>>> org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:638)
>>> at
>>> org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:153)
>>> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
>>> at
>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>>> at
>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>>> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>>> at org.apache.spark.scheduler.Task.run(Task.scala:88)
>>> at
>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>> at java.lang.Thread.run(Thread.java:745)
>>>
>>> My use case is to read the files from S3 and do some processing. I am
>>> caching the data like below in order to avoid SocketTimeoutExceptions from
>>> another library I am using for the processing.
>>>
>>> val rdd1 = sc.textFile("***").coalesce(1000)
>>> rdd1.persist(DISK_ONLY_2) // replication factor 2
>>> rdd1.foreachPartition { iter => } // one pass over the data to download
>>>
>>> The 3rd line fails with the above error when a partition contains a file
>>> of size more than 2GB file.
>>>
>>> Do you think this needs to be fixed in Spark? One idea may be is to use
>>> a wrapper class (something called BigByteBuffer) which keeps an array of
>>> ByteBuffers and keeps the index of the current buffer being read etc. Below
>>> is the modified DiskStore.scala.
>>>
>>> private def getBytes(file: File, offset: Long, length: Long): 
>>> Option[ByteBuffer] = {
>>>   val channel = new RandomAccessFile(file, "r").getChannel
>>>   Utils.tryWithSafeFinally {
>>> // For small files, directly read rather than memory map
>>> if (length < minMemoryMapBytes) {
>>>   // Map small file in Memory
>>> } else {
>>>   // TODO Create a BigByteBuffer
>>>
>>> }
>>>   } {
>>> channel.close()
>>>   }
>>> }
>>>
>>> class BigByteBuffer extends ByteBuffer {
>>>   val buffers: Array[ByteBuffer]
>>>   var currentIndex = 0
>>>
>>>   ... // Other methods
>>> }
>>>
>>> Please let me know if there is any other work-around for the same. Thanks 
>>> for your time.
>>>
>>> Regards,
>>> Jegan
>>>
>>
>>
>


Re: IllegalArgumentException: Size exceeds Integer.MAX_VALUE

2015-10-05 Thread Jegan
I am sorry, I didn't understand it completely. Are you suggesting to copy
the files from S3 to HDFS? Actually, that is what I am doing. I am reading
the files using Spark and persisting it locally.

Or did you actually mean to ask the producer to write the files directly to
HDFS instead of S3? I am not sure I can do this now either.

Please clarify me if I misunderstood what you meant.

Thanks,
Jegan

On Mon, Oct 5, 2015 at 4:42 PM, Reynold Xin  wrote:

> You can write the data to local hdfs (or local disk) and just load it from
> there.
>
>
> On Mon, Oct 5, 2015 at 4:37 PM, Jegan  wrote:
>
>> Thanks for your suggestion Ted.
>>
>> Unfortunately at this point of time I cannot go beyond 1000 partitions. I
>> am writing this data to BigQuery and it has a limit of 1000 jobs per day
>> for a table(they have some limits on this)  I currently create 1 load job
>> per partition. Is there any other work-around?
>>
>> Thanks again.
>>
>> Regards,
>> Jegan
>>
>> On Mon, Oct 5, 2015 at 3:53 PM, Ted Yu  wrote:
>>
>>> As a workaround, can you set the number of partitions higher in the
>>> sc.textFile method ?
>>>
>>> Cheers
>>>
>>> On Mon, Oct 5, 2015 at 3:31 PM, Jegan  wrote:
>>>
 Hi All,

 I am facing the below exception when the size of the file being read in
 a partition is above 2GB. This is apparently because Java's limitation on
 memory mapped files. It supports mapping only 2GB files.

 Caused by: java.lang.IllegalArgumentException: Size exceeds
 Integer.MAX_VALUE
 at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:836)
 at
 org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:125)
 at
 org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:113)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1207)
 at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:127)
 at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:134)
 at
 org.apache.spark.storage.DiskStore.putIterator(DiskStore.scala:102)
 at
 org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:791)
 at
 org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:638)
 at
 org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:153)
 at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
 at
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
 at
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
 at
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
 at org.apache.spark.scheduler.Task.run(Task.scala:88)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)

 My use case is to read the files from S3 and do some processing. I am
 caching the data like below in order to avoid SocketTimeoutExceptions from
 another library I am using for the processing.

 val rdd1 = sc.textFile("***").coalesce(1000)
 rdd1.persist(DISK_ONLY_2) // replication factor 2
 rdd1.foreachPartition { iter => } // one pass over the data to download

 The 3rd line fails with the above error when a partition contains a
 file of size more than 2GB file.

 Do you think this needs to be fixed in Spark? One idea may be is to use
 a wrapper class (something called BigByteBuffer) which keeps an array of
 ByteBuffers and keeps the index of the current buffer being read etc. Below
 is the modified DiskStore.scala.

 private def getBytes(file: File, offset: Long, length: Long): 
 Option[ByteBuffer] = {
   val channel = new RandomAccessFile(file, "r").getChannel
   Utils.tryWithSafeFinally {
 // For small files, directly read rather than memory map
 if (length < minMemoryMapBytes) {
   // Map small file in Memory
 } else {
   // TODO Create a BigByteBuffer

 }
   } {
 channel.close()
   }
 }

 class BigByteBuffer extends ByteBuffer {
   val buffers: Array[ByteBuffer]
   var currentIndex = 0

   ... // Other methods
 }

 Please let me know if there is any other 

Re: spark hive branch location

2015-10-05 Thread weoccc
Hi Michael,

Thanks for pointing me the branch. What's the build instructions to build
the hive 1.2.1 release branch for spark 1.5 ?

Weide

On Mon, Oct 5, 2015 at 12:06 PM, Michael Armbrust 
wrote:

> I think this is the most up to date branch (used in Spark 1.5):
> https://github.com/pwendell/hive/tree/release-1.2.1-spark
>
> On Mon, Oct 5, 2015 at 1:03 PM, weoccc  wrote:
>
>> Hi,
>>
>> I would like to know where is the spark hive github location where spark
>> build depend on ? I was told it used to be here
>> https://github.com/pwendell/hive but it seems it is no longer there.
>>
>> Thanks a lot,
>>
>> Weide
>>
>
>


Re: IllegalArgumentException: Size exceeds Integer.MAX_VALUE

2015-10-05 Thread Reynold Xin
I meant to say just copy everything to a local hdfs, and then don't use
caching ...


On Mon, Oct 5, 2015 at 4:52 PM, Jegan  wrote:

> I am sorry, I didn't understand it completely. Are you suggesting to copy
> the files from S3 to HDFS? Actually, that is what I am doing. I am reading
> the files using Spark and persisting it locally.
>
> Or did you actually mean to ask the producer to write the files directly
> to HDFS instead of S3? I am not sure I can do this now either.
>
> Please clarify me if I misunderstood what you meant.
>
> Thanks,
> Jegan
>
> On Mon, Oct 5, 2015 at 4:42 PM, Reynold Xin  wrote:
>
>> You can write the data to local hdfs (or local disk) and just load it
>> from there.
>>
>>
>> On Mon, Oct 5, 2015 at 4:37 PM, Jegan  wrote:
>>
>>> Thanks for your suggestion Ted.
>>>
>>> Unfortunately at this point of time I cannot go beyond 1000 partitions.
>>> I am writing this data to BigQuery and it has a limit of 1000 jobs per day
>>> for a table(they have some limits on this)  I currently create 1 load job
>>> per partition. Is there any other work-around?
>>>
>>> Thanks again.
>>>
>>> Regards,
>>> Jegan
>>>
>>> On Mon, Oct 5, 2015 at 3:53 PM, Ted Yu  wrote:
>>>
 As a workaround, can you set the number of partitions higher in the
 sc.textFile method ?

 Cheers

 On Mon, Oct 5, 2015 at 3:31 PM, Jegan  wrote:

> Hi All,
>
> I am facing the below exception when the size of the file being read
> in a partition is above 2GB. This is apparently because Java's limitation
> on memory mapped files. It supports mapping only 2GB files.
>
> Caused by: java.lang.IllegalArgumentException: Size exceeds
> Integer.MAX_VALUE
> at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:836)
> at
> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:125)
> at
> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:113)
> at
> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1207)
> at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:127)
> at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:134)
> at
> org.apache.spark.storage.DiskStore.putIterator(DiskStore.scala:102)
> at
> org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:791)
> at
> org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:638)
> at
> org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:153)
> at
> org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
> My use case is to read the files from S3 and do some processing. I am
> caching the data like below in order to avoid SocketTimeoutExceptions from
> another library I am using for the processing.
>
> val rdd1 = sc.textFile("***").coalesce(1000)
> rdd1.persist(DISK_ONLY_2) // replication factor 2
> rdd1.foreachPartition { iter => } // one pass over the data to download
>
> The 3rd line fails with the above error when a partition contains a
> file of size more than 2GB file.
>
> Do you think this needs to be fixed in Spark? One idea may be is to
> use a wrapper class (something called BigByteBuffer) which keeps an array
> of ByteBuffers and keeps the index of the current buffer being read etc.
> Below is the modified DiskStore.scala.
>
> private def getBytes(file: File, offset: Long, length: Long): 
> Option[ByteBuffer] = {
>   val channel = new RandomAccessFile(file, "r").getChannel
>   Utils.tryWithSafeFinally {
> // For small files, directly read rather than memory map
> if (length < minMemoryMapBytes) {
>   // Map small file in Memory
> } else {
>   // TODO Create a 

Re: IllegalArgumentException: Size exceeds Integer.MAX_VALUE

2015-10-05 Thread Ted Yu
As a workaround, can you set the number of partitions higher in the
sc.textFile method ?

Cheers

On Mon, Oct 5, 2015 at 3:31 PM, Jegan  wrote:

> Hi All,
>
> I am facing the below exception when the size of the file being read in a
> partition is above 2GB. This is apparently because Java's limitation on
> memory mapped files. It supports mapping only 2GB files.
>
> Caused by: java.lang.IllegalArgumentException: Size exceeds
> Integer.MAX_VALUE
> at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:836)
> at
> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:125)
> at
> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:113)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1207)
> at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:127)
> at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:134)
> at org.apache.spark.storage.DiskStore.putIterator(DiskStore.scala:102)
> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:791)
> at
> org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:638)
> at
> org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:153)
> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
> My use case is to read the files from S3 and do some processing. I am
> caching the data like below in order to avoid SocketTimeoutExceptions from
> another library I am using for the processing.
>
> val rdd1 = sc.textFile("***").coalesce(1000)
> rdd1.persist(DISK_ONLY_2) // replication factor 2
> rdd1.foreachPartition { iter => } // one pass over the data to download
>
> The 3rd line fails with the above error when a partition contains a file
> of size more than 2GB file.
>
> Do you think this needs to be fixed in Spark? One idea may be is to use a
> wrapper class (something called BigByteBuffer) which keeps an array of
> ByteBuffers and keeps the index of the current buffer being read etc. Below
> is the modified DiskStore.scala.
>
> private def getBytes(file: File, offset: Long, length: Long): 
> Option[ByteBuffer] = {
>   val channel = new RandomAccessFile(file, "r").getChannel
>   Utils.tryWithSafeFinally {
> // For small files, directly read rather than memory map
> if (length < minMemoryMapBytes) {
>   // Map small file in Memory
> } else {
>   // TODO Create a BigByteBuffer
>
> }
>   } {
> channel.close()
>   }
> }
>
> class BigByteBuffer extends ByteBuffer {
>   val buffers: Array[ByteBuffer]
>   var currentIndex = 0
>
>   ... // Other methods
> }
>
> Please let me know if there is any other work-around for the same. Thanks for 
> your time.
>
> Regards,
> Jegan
>


Re: StructType has more rows, than corresponding Row has objects.

2015-10-05 Thread Davies Liu
Could you tell us a way to reproduce this failure? Reading from JSON or Parquet?

On Mon, Oct 5, 2015 at 4:28 AM, Eugene Morozov
 wrote:
> Hi,
>
> We're building our own framework on top of spark and we give users pretty
> complex schema to work with. That requires from us to build dataframes by
> ourselves: we transform business objects to rows and struct types and uses
> these two to create dataframe.
>
> Everything was fine until I started to upgrade to spark 1.5.0 (from 1.3.1).
> Seems to be catalyst engine has been changed and now using almost the same
> code to produce rows and struct types I have the following:
> http://ibin.co/2HzUsoe9O96l, some of rows in the end result have different
> number of values and corresponding struct types.
>
> I'm almost sure it's my own fault, but there is always a small chance, that
> something is wrong in spark codebase. If you've seen something similar or if
> there is a jira for smth similar, I'd be glad to know. Thanks.
> --
> Be well!
> Jean Morozov

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Dataframes: PrunedFilteredScan without Spark Side Filtering

2015-10-05 Thread Russell Spitzer
That sounds fine to me, we already do the filtering so populating that
field would be pretty simple.

On Sun, Sep 27, 2015 at 2:08 PM Michael Armbrust 
wrote:

> We have to try and maintain binary compatibility here, so probably the
> easiest thing to do here would be to add a method to the class.  Perhaps
> something like:
>
> def unhandledFilters(filters: Array[Filter]): Array[Filter] = filters
>
> By default, this could return all filters so behavior would remain the
> same, but specific implementations could override it.  There is still a
> chance that this would conflict with existing methods, but hopefully that
> would not be a problem in practice.
>
> Thoughts?
>
> Michael
>
> On Fri, Sep 25, 2015 at 10:02 PM, Russell Spitzer <
> russell.spit...@gmail.com> wrote:
>
>> Hi! First time poster, long time reader.
>>
>> I'm wondering if there is a way to let cataylst know that it doesn't need
>> to repeat a filter on the spark side after a filter has been applied by the
>> Source Implementing PrunedFilterScan.
>>
>>
>> This is for a usecase in which we except a filter on a non-existant
>> column that serves as an entry point for our integration with a different
>> system. While the source can correctly deal with this, the secondary filter
>> done on the RDD itself wipes out the results because the column being
>> filtered does not exist.
>>
>> In particular this is with our integration with Solr where we allow users
>> to pass in a predicate based on "solr_query" ala ("where solr_query='*:*')
>> there is no column "solr_query" so the rdd.filter( row.solr_query == "*:*')
>> filters out all of the data since no row's will have that column.
>>
>> I'm thinking about a few solutions to this but they all seem a little
>> hacky
>> 1) Try to manually remove the filter step from the query plan after our
>> source handles the filter
>> 2) Populate the solr_query field being returned so they all automatically
>> pass
>>
>> But I think the real solution is to add a way to create a
>> PrunedFilterScan which does not reapply filters if the source doesn't want
>> it to. IE Giving PrunedFilterScan the ability to trust the underlying
>> source that the filter will be accurately applied. Maybe changing the api
>> to
>>
>> PrunedFilterScan(requiredColumns: Array[String], filters: Array[Filter],
>> reapply: Boolean = true)
>>
>> Where Catalyst can check the Reapply value and not add an RDD.filter if
>> it is false.
>>
>> Thoughts?
>>
>> Thanks for your time,
>> Russ
>>
>
>


Re: IllegalArgumentException: Size exceeds Integer.MAX_VALUE

2015-10-05 Thread Jegan
Thanks for your suggestion Ted.

Unfortunately at this point of time I cannot go beyond 1000 partitions. I
am writing this data to BigQuery and it has a limit of 1000 jobs per day
for a table(they have some limits on this)  I currently create 1 load job
per partition. Is there any other work-around?

Thanks again.

Regards,
Jegan

On Mon, Oct 5, 2015 at 3:53 PM, Ted Yu  wrote:

> As a workaround, can you set the number of partitions higher in the
> sc.textFile method ?
>
> Cheers
>
> On Mon, Oct 5, 2015 at 3:31 PM, Jegan  wrote:
>
>> Hi All,
>>
>> I am facing the below exception when the size of the file being read in a
>> partition is above 2GB. This is apparently because Java's limitation on
>> memory mapped files. It supports mapping only 2GB files.
>>
>> Caused by: java.lang.IllegalArgumentException: Size exceeds
>> Integer.MAX_VALUE
>> at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:836)
>> at
>> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:125)
>> at
>> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:113)
>> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1207)
>> at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:127)
>> at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:134)
>> at org.apache.spark.storage.DiskStore.putIterator(DiskStore.scala:102)
>> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:791)
>> at
>> org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:638)
>> at
>> org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:153)
>> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
>> at
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>> at
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>> at org.apache.spark.scheduler.Task.run(Task.scala:88)
>> at
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> at java.lang.Thread.run(Thread.java:745)
>>
>> My use case is to read the files from S3 and do some processing. I am
>> caching the data like below in order to avoid SocketTimeoutExceptions from
>> another library I am using for the processing.
>>
>> val rdd1 = sc.textFile("***").coalesce(1000)
>> rdd1.persist(DISK_ONLY_2) // replication factor 2
>> rdd1.foreachPartition { iter => } // one pass over the data to download
>>
>> The 3rd line fails with the above error when a partition contains a file
>> of size more than 2GB file.
>>
>> Do you think this needs to be fixed in Spark? One idea may be is to use a
>> wrapper class (something called BigByteBuffer) which keeps an array of
>> ByteBuffers and keeps the index of the current buffer being read etc. Below
>> is the modified DiskStore.scala.
>>
>> private def getBytes(file: File, offset: Long, length: Long): 
>> Option[ByteBuffer] = {
>>   val channel = new RandomAccessFile(file, "r").getChannel
>>   Utils.tryWithSafeFinally {
>> // For small files, directly read rather than memory map
>> if (length < minMemoryMapBytes) {
>>   // Map small file in Memory
>> } else {
>>   // TODO Create a BigByteBuffer
>>
>> }
>>   } {
>> channel.close()
>>   }
>> }
>>
>> class BigByteBuffer extends ByteBuffer {
>>   val buffers: Array[ByteBuffer]
>>   var currentIndex = 0
>>
>>   ... // Other methods
>> }
>>
>> Please let me know if there is any other work-around for the same. Thanks 
>> for your time.
>>
>> Regards,
>> Jegan
>>
>
>


Re: Difference between a task and a job

2015-10-05 Thread Daniel Darabos
Actions trigger jobs. A job is made up of stages. A stage is made up of
tasks. Executor threads execute tasks.

Does that answer your question?

On Mon, Oct 5, 2015 at 12:52 PM, Guna Prasaad  wrote:

> What is the difference between a task and a job in spark and
> spark-streaming?
>
> Regards,
> Guna
>


StructType has more rows, than corresponding Row has objects.

2015-10-05 Thread Eugene Morozov
Hi,

We're building our own framework on top of spark and we give users pretty
complex schema to work with. That requires from us to build dataframes by
ourselves: we transform business objects to rows and struct types and uses
these two to create dataframe.

Everything was fine until I started to upgrade to spark 1.5.0 (from 1.3.1).
Seems to be catalyst engine has been changed and now using almost the same
code to produce rows and struct types I have the following:
http://ibin.co/2HzUsoe9O96l, some of rows in the end result have different
number of values and corresponding struct types.

I'm almost sure it's my own fault, but there is always a small chance, that
something is wrong in spark codebase. If you've seen something similar or
if there is a jira for smth similar, I'd be glad to know. Thanks.
--
Be well!
Jean Morozov


Difference between a task and a job

2015-10-05 Thread Guna Prasaad
What is the difference between a task and a job in spark and
spark-streaming?

Regards,
Guna


Re: Dataframe nested schema inference from Json without type conflicts

2015-10-05 Thread Yin Huai
Hello Ewan,

Adding a JSON-specific option makes sense. Can you open a JIRA for this?
Also, sending out a PR will be great. For JSONRelation, I think we can pass
all user-specific options to it (see
org.apache.spark.sql.execution.datasources.json.DefaultSource's
createRelation) just like what we do for ParquetRelation. Then, inside
JSONRelation, we figure out what kind of options that have been specified.

Thanks,

Yin

On Mon, Oct 5, 2015 at 9:04 AM, Ewan Leith 
wrote:

> I’ve done some digging today and, as a quick and ugly fix, altering the
> case statement of the JSON inferField function in InferSchema.scala
>
>
>
>
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/InferSchema.scala
>
>
>
> to have
>
>
>
> case VALUE_STRING | VALUE_NUMBER_INT | VALUE_NUMBER_FLOAT | VALUE_TRUE |
> VALUE_FALSE => StringType
>
>
>
> rather than the rules for each type works as we’d want.
>
>
>
> If we were to wrap this up in a configuration setting in JSONRelation like
> the samplingRatio setting, with the default being to behave as it currently
> works, does anyone think a pull request would plausibly get into the Spark
> main codebase?
>
>
>
> Thanks,
>
> Ewan
>
>
>
>
>
>
>
> *From:* Ewan Leith [mailto:ewan.le...@realitymine.com]
> *Sent:* 02 October 2015 01:57
> *To:* yh...@databricks.com
>
> *Cc:* r...@databricks.com; dev@spark.apache.org
> *Subject:* Re: Dataframe nested schema inference from Json without type
> conflicts
>
>
>
> Exactly, that's a much better way to put it.
>
>
>
> Thanks,
>
> Ewan
>
>
>
> -- Original message--
>
> *From: *Yin Huai
>
> *Date: *Thu, 1 Oct 2015 23:54
>
> *To: *Ewan Leith;
>
> *Cc: *r...@databricks.com;dev@spark.apache.org;
>
> *Subject:*Re: Dataframe nested schema inference from Json without type
> conflicts
>
>
>
> Hi Ewan,
>
>
>
> For your use case, you only need the schema inference to pick up the
> structure of your data (basically you want spark sql to infer the type of
> complex values like arrays and structs but keep the type of primitive
> values as strings), right?
>
>
>
> Thanks,
>
>
>
> Yin
>
>
>
> On Thu, Oct 1, 2015 at 2:27 PM, Ewan Leith 
> wrote:
>
> We could, but if a client sends some unexpected records in the schema
> (which happens more than I'd like, our schema seems to constantly evolve),
> its fantastic how Spark picks up on that data and includes it.
>
>
>
> Passing in a fixed schema loses that nice additional ability, though it's
> what we'll probably have to adopt if we can't come up with a way to keep
> the inference working.
>
>
>
> Thanks,
>
> Ewan
>
>
>
> -- Original message--
>
> *From: *Reynold Xin
>
> *Date: *Thu, 1 Oct 2015 22:12
>
> *To: *Ewan Leith;
>
> *Cc: *dev@spark.apache.org;
>
> *Subject:*Re: Dataframe nested schema inference from Json without type
> conflicts
>
>
>
> You can pass the schema into json directly, can't you?
>
>
>
> On Thu, Oct 1, 2015 at 10:33 AM, Ewan Leith 
> wrote:
>
> Hi all,
>
>
>
> We really like the ability to infer a schema from JSON contained in an
> RDD, but when we’re using Spark Streaming on small batches of data, we
> sometimes find that Spark infers a more specific type than it should use,
> for example if the json in that small batch only contains integer values
> for a String field, it’ll class the field as an Integer type on one
> Streaming batch, then a String on the next one.
>
>
>
> Instead, we’d rather match every value as a String type, then handle any
> casting to a desired type later in the process.
>
>
>
> I don’t think there’s currently any simple way to avoid this that I can
> see, but we could add the functionality in the JacksonParser.scala file,
> probably in convertField.
>
>
>
>
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonParser.scala
>
>
>
> Does anyone know an easier and cleaner way to do this?
>
>
>
> Thanks,
>
> Ewan
>
>
>
>
>


Re: Spark 1.5.1 - Scala 2.10 - Hadoop 1 package is missing from S3

2015-10-05 Thread Nicholas Chammas
Blaž said:

Also missing is
http://s3.amazonaws.com/spark-related-packages/spark-1.5.1-bin-hadoop1.tgz
which breaks spark-ec2 script.

This is the package I am referring to in my original email.

Nick said:

It appears that almost every version of Spark up to and including 1.5.0 has
included a —bin-hadoop1.tgz release (e.g. spark-1.5.0-bin-hadoop1.tgz).
However, 1.5.1 has no such package.

Nick
​

On Mon, Oct 5, 2015 at 3:27 AM Blaž Šnuderl  wrote:

> Also missing is 
> http://s3.amazonaws.com/spark-related-packages/spark-1.5.1-bin-hadoop1.tgz
> which breaks spark-ec2 script.
>
> On Mon, Oct 5, 2015 at 5:20 AM, Ted Yu  wrote:
>
>> hadoop1 package for Scala 2.10 wasn't in RC1 either:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-bin/
>>
>> On Sun, Oct 4, 2015 at 5:17 PM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> I’m looking here:
>>>
>>> https://s3.amazonaws.com/spark-related-packages/
>>>
>>> I believe this is where one set of official packages is published.
>>> Please correct me if this is not the case.
>>>
>>> It appears that almost every version of Spark up to and including 1.5.0
>>> has included a --bin-hadoop1.tgz release (e.g.
>>> spark-1.5.0-bin-hadoop1.tgz).
>>>
>>> However, 1.5.1 has no such package. There is a
>>> spark-1.5.1-bin-hadoop1-scala2.11.tgz package, but this is a separate
>>> thing. (1.5.0 also has a hadoop1-scala2.11 package.)
>>>
>>> Was this intentional?
>>>
>>> More importantly, is there some rough specification for what packages we
>>> should be able to expect in this S3 bucket with every release?
>>>
>>> This is important for those of us who depend on this publishing venue
>>> (e.g. spark-ec2 and related tools).
>>>
>>> Nick
>>> ​
>>>
>>
>>
>


Re: Dataframe nested schema inference from Json without type conflicts

2015-10-05 Thread Ewan Leith
Thanks Yin, I'll put together a JIRA and a PR tomorrow.


Ewan


-- Original message--

From: Yin Huai

Date: Mon, 5 Oct 2015 17:39

To: Ewan Leith;

Cc: dev@spark.apache.org;

Subject:Re: Dataframe nested schema inference from Json without type conflicts


Hello Ewan,

Adding a JSON-specific option makes sense. Can you open a JIRA for this? Also, 
sending out a PR will be great. For JSONRelation, I think we can pass all 
user-specific options to it (see 
org.apache.spark.sql.execution.datasources.json.DefaultSource's createRelation) 
just like what we do for ParquetRelation. Then, inside JSONRelation, we figure 
out what kind of options that have been specified.

Thanks,

Yin

On Mon, Oct 5, 2015 at 9:04 AM, Ewan Leith 
> wrote:
I've done some digging today and, as a quick and ugly fix, altering the case 
statement of the JSON inferField function in InferSchema.scala

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/InferSchema.scala

to have

case VALUE_STRING | VALUE_NUMBER_INT | VALUE_NUMBER_FLOAT | VALUE_TRUE | 
VALUE_FALSE => StringType

rather than the rules for each type works as we'd want.

If we were to wrap this up in a configuration setting in JSONRelation like the 
samplingRatio setting, with the default being to behave as it currently works, 
does anyone think a pull request would plausibly get into the Spark main 
codebase?

Thanks,
Ewan



From: Ewan Leith 
[mailto:ewan.le...@realitymine.com]
Sent: 02 October 2015 01:57
To: yh...@databricks.com

Cc: r...@databricks.com; 
dev@spark.apache.org
Subject: Re: Dataframe nested schema inference from Json without type conflicts


Exactly, that's a much better way to put it.



Thanks,

Ewan



-- Original message--

From: Yin Huai

Date: Thu, 1 Oct 2015 23:54

To: Ewan Leith;

Cc: 
r...@databricks.com;dev@spark.apache.org;

Subject:Re: Dataframe nested schema inference from Json without type conflicts


Hi Ewan,

For your use case, you only need the schema inference to pick up the structure 
of your data (basically you want spark sql to infer the type of complex values 
like arrays and structs but keep the type of primitive values as strings), 
right?

Thanks,

Yin

On Thu, Oct 1, 2015 at 2:27 PM, Ewan Leith 
> wrote:

We could, but if a client sends some unexpected records in the schema (which 
happens more than I'd like, our schema seems to constantly evolve), its 
fantastic how Spark picks up on that data and includes it.



Passing in a fixed schema loses that nice additional ability, though it's what 
we'll probably have to adopt if we can't come up with a way to keep the 
inference working.



Thanks,

Ewan



-- Original message--

From: Reynold Xin

Date: Thu, 1 Oct 2015 22:12

To: Ewan Leith;

Cc: dev@spark.apache.org;

Subject:Re: Dataframe nested schema inference from Json without type conflicts


You can pass the schema into json directly, can't you?

On Thu, Oct 1, 2015 at 10:33 AM, Ewan Leith 
> wrote:
Hi all,

We really like the ability to infer a schema from JSON contained in an RDD, but 
when we're using Spark Streaming on small batches of data, we sometimes find 
that Spark infers a more specific type than it should use, for example if the 
json in that small batch only contains integer values for a String field, it'll 
class the field as an Integer type on one Streaming batch, then a String on the 
next one.

Instead, we'd rather match every value as a String type, then handle any 
casting to a desired type later in the process.

I don't think there's currently any simple way to avoid this that I can see, 
but we could add the functionality in the JacksonParser.scala file, probably in 
convertField.

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonParser.scala

Does anyone know an easier and cleaner way to do this?

Thanks,
Ewan





RE: Dataframe nested schema inference from Json without type conflicts

2015-10-05 Thread Ewan Leith
I've done some digging today and, as a quick and ugly fix, altering the case 
statement of the JSON inferField function in InferSchema.scala

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/InferSchema.scala

to have

case VALUE_STRING | VALUE_NUMBER_INT | VALUE_NUMBER_FLOAT | VALUE_TRUE | 
VALUE_FALSE => StringType

rather than the rules for each type works as we'd want.

If we were to wrap this up in a configuration setting in JSONRelation like the 
samplingRatio setting, with the default being to behave as it currently works, 
does anyone think a pull request would plausibly get into the Spark main 
codebase?

Thanks,
Ewan



From: Ewan Leith [mailto:ewan.le...@realitymine.com]
Sent: 02 October 2015 01:57
To: yh...@databricks.com
Cc: r...@databricks.com; dev@spark.apache.org
Subject: Re: Dataframe nested schema inference from Json without type conflicts


Exactly, that's a much better way to put it.



Thanks,

Ewan



-- Original message--

From: Yin Huai

Date: Thu, 1 Oct 2015 23:54

To: Ewan Leith;

Cc: 
r...@databricks.com;dev@spark.apache.org;

Subject:Re: Dataframe nested schema inference from Json without type conflicts


Hi Ewan,

For your use case, you only need the schema inference to pick up the structure 
of your data (basically you want spark sql to infer the type of complex values 
like arrays and structs but keep the type of primitive values as strings), 
right?

Thanks,

Yin

On Thu, Oct 1, 2015 at 2:27 PM, Ewan Leith 
> wrote:

We could, but if a client sends some unexpected records in the schema (which 
happens more than I'd like, our schema seems to constantly evolve), its 
fantastic how Spark picks up on that data and includes it.



Passing in a fixed schema loses that nice additional ability, though it's what 
we'll probably have to adopt if we can't come up with a way to keep the 
inference working.



Thanks,

Ewan



-- Original message--

From: Reynold Xin

Date: Thu, 1 Oct 2015 22:12

To: Ewan Leith;

Cc: dev@spark.apache.org;

Subject:Re: Dataframe nested schema inference from Json without type conflicts


You can pass the schema into json directly, can't you?

On Thu, Oct 1, 2015 at 10:33 AM, Ewan Leith 
> wrote:
Hi all,

We really like the ability to infer a schema from JSON contained in an RDD, but 
when we're using Spark Streaming on small batches of data, we sometimes find 
that Spark infers a more specific type than it should use, for example if the 
json in that small batch only contains integer values for a String field, it'll 
class the field as an Integer type on one Streaming batch, then a String on the 
next one.

Instead, we'd rather match every value as a String type, then handle any 
casting to a desired type later in the process.

I don't think there's currently any simple way to avoid this that I can see, 
but we could add the functionality in the JacksonParser.scala file, probably in 
convertField.

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonParser.scala

Does anyone know an easier and cleaner way to do this?

Thanks,
Ewan