Create multiple output files from Thriftserver

2016-04-28 Thread mayankshete
Is there a way to create multiple output files when connected from beeline to
the Thriftserver ?
Right now i am using beeline -e 'query' > output.txt which is not efficient
as it uses linux operator to combine output files .



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Create-multiple-output-files-from-Thriftserver-tp26845.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: [2 BUG REPORT] failed to run make-distribution.sh when a older version maven installed in system and run VersionsSuite test hang

2016-04-28 Thread Ted Yu
For #1, have you seen this JIRA ?

[SPARK-14867][BUILD] Remove `--force` option in `build/mvn`

On Thu, Apr 28, 2016 at 8:27 PM, Demon King  wrote:

> BUG 1:
> I have installed maven 3.0.2 in system,  When I using make-distribution.sh
> , it seem not use maven 3.2.2 but use /usr/local/bin/mvn to build spark. So
> I add --force option in make-distribution.sh like this:
>
> line 130:
> VERSION=$("$MVN" *--force* help:evaluate -Dexpression=project.version $@
> 2>/dev/null | grep -v "INFO" | tail -n 1)
> SCALA_VERSION=$("$MVN"* --force* help:evaluate
> -Dexpression=scala.binary.version $@ 2>/dev/null\
> | grep -v "INFO"\
> | tail -n 1)
> SPARK_HADOOP_VERSION=$("$MVN" *--force* help:evaluate
> -Dexpression=hadoop.version $@ 2>/dev/null\
> | grep -v "INFO"\
> | tail -n 1)
> SPARK_HIVE=$("$MVN"* --force* help:evaluate
> -Dexpression=project.activeProfiles -pl sql/hive $@ 2>/dev/null\
> | grep -v "INFO"\
> | fgrep --count "hive";\
> # Reset exit status to 0, otherwise the script stops here if the last
> grep finds nothing\
> # because we use "set -o pipefail"
> echo -n)
>
> line 170:
> BUILD_COMMAND=("$MVN" *--force* clean package -DskipTests $@)
>
> that will force spark to use build/mvn and solve this problem.
>
> BUG 2:
>
> When I run running unit test VersionsSuite, it will hang for one night or
> more. I use jstack and lsof and find it try to send a http request. That
> seems not be a good item when runing test in terrible network.
>
> I use jstack and finally find out reason:
>
>   java.lang.Thread.State: RUNNABLE
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:152)
> at java.net.SocketInputStream.read(SocketInputStream.java:122)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
> - locked <0x0007440224d8> (a java.io.BufferedInputStream)
> at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:687)
> at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:633)
> at
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1323)
> - locked <0x000744022530> (a
> sun.net.www.protocol.http.HttpURLConnection)
> at
> java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:468)
> ...
>
> and I use lsof:
>
> java32082 user 247u  IPv4 527001934   TCP 8.8.8.8:33233 (LISTEN)
> java32082 user  267u  IPv4 527001979   TCP 8.8.8.8:52301 (LISTEN)
> java32082 user  316u  IPv4 527001999   TCP *:51993 (LISTEN)
> java32082 user  521u  IPv4 527111590   TCP 8.8.8.8:53286
> ->butan141.server4you.de:http (ESTABLISHED)
>
> This test suite try to connect butan141.server4you.de, The process will
> hang when network is terrible .
>
>
>


Re: executor delay in Spark

2016-04-28 Thread Raghava Mutharaju
Hello Mike,

No problem, logs are useful to us anyway. Thank you for all the pointers.
We started off with examining only a single RDD but later on added a few
more. The persist count and unpersist count sequence is the dummy stage
that you suggested us to use to avoid the initial scheduler delay.

Our issue is very similar to the one you posted:
http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-Partitions-not-distributed-evenly-to-executors-td16988.html.
We tried with spark.shuffle.reduceLocality.enabled=false and it helps in
certain cases. Were you able to fix that issue? We use Spark 1.6.0

We noticed the following
1) persisting an RDD seems to lead to unbalanced distribution of partitions
across the executors.
2) If one RDD has an all-nothing skew then rest of the RDDs that depend on
it also get all-nothing skew.

Regards,
Raghava.


On Wed, Apr 27, 2016 at 10:20 AM, Mike Hynes <91m...@gmail.com> wrote:

> Hi Raghava,
>
> I'm terribly sorry about the end of my last email; that garbled
> sentence was garbled because it wasn't meant to exist; I wrote it on
> my phone, realized I wouldn't realistically have time to look into
> another set of logs deeply enough, and then mistook myself for having
> deleted it. Again, I'm very sorry for my error here.
>
> I did peek at your code, though, and think you could try the following:
> 0. The actions in your main method are many, and it will be hard to
> isolate a problem; I would recommend only examing *one* RDD at first,
> rather than six.
> 1. There is a lot of repetition for reading RDDs from textfiles
> sequentially; if you put those lines into two methods depending on RDD
> type, you will at least have one entry point to work with once you
> make a simplified test program.
> 2. In one part you persist, count, immediately unpersist, and then
> count again an RDD.. I'm not acquainted with this idiom, and I don't
> understand what that is to achieve. It strikes me suspect for
> triggering unusual garbage collection, which would, I think, only
> complicate your trace debugging.
>
> I've attached a python script that dumps relevant info from the Spark
> JSON logs into a CSV for easier analysis in you language of choice;
> hopefully it can aid in finer grained debugging (the headers of the
> fields it prints are listed in one of the functions).
>
> Mike
>
> On 4/25/16, Raghava Mutharaju  wrote:
> > Mike,
> >
> > We ran our program with 16, 32 and 64 partitions. The behavior was same
> as
> > before with 8 partitions. It was mixed -- for some RDDs we see an
> > all-nothing skew, but for others we see them getting split across the 2
> > worker nodes. In some cases, they start with even split and in later
> > iterations it goes to all-nothing split. Please find the logs attached.
> >
> > our program source code:
> >
> https://github.com/raghavam/sparkel/blob/275ecbb901a58592d8a70a8568dd95c839d46ecc/src/main/scala/org/daselab/sparkel/SparkELDAGAnalysis.scala
> >
> > We put in persist() statements for different RDDs just to check their
> skew.
> >
> > @Jeff, setting minRegisteredResourcesRatio did not help. Behavior was
> same
> > as before.
> >
> > Thank you for your time.
> >
> > Regards,
> > Raghava.
> >
> >
> > On Sun, Apr 24, 2016 at 7:17 PM, Mike Hynes <91m...@gmail.com> wrote:
> >
> >> Could you change numPartitions to {16, 32, 64} and run your program for
> >> each to see how many partitions are allocated to each worker? Let's see
> >> if
> >> you experience an all-nothing imbalance that way; if so, my guess is
> that
> >> something else is odd in your program logic or spark runtime
> environment,
> >> but if not and your executors all receive at least *some* partitions,
> >> then
> >> I still wouldn't rule out effects of scheduling delay. It's a simple
> >> test,
> >> but it could give some insight.
> >>
> >> Mike
> >>
> >> his could still be a  scheduling  If only one has *all* partitions,  and
> >> email me the log file? (If it's 10+ MB, just the first few thousand
> lines
> >> are fine).
> >> On Apr 25, 2016 7:07 AM, "Raghava Mutharaju"  >
> >> wrote:
> >>
> >>> Mike, All,
> >>>
> >>> It turns out that the second time we encountered the uneven-partition
> >>> issue is not due to spark-submit. It was resolved with the change in
> >>> placement of count().
> >>>
> >>> Case-1:
> >>>
> >>> val numPartitions = 8
> >>> // read uAxioms from HDFS, use hash partitioner on it and persist it
> >>> // read type1Axioms from HDFS, use hash partitioner on it and persist
> it
> >>> currDeltaURule1 = type1Axioms.join(uAxioms)
> >>>  .values
> >>>  .distinct(numPartitions)
> >>>  .partitionBy(hashPartitioner)
> >>> currDeltaURule1 = currDeltaURule1.setName("deltaURule1_" + loopCounter)
> >>>
> >>>  .persist(StorageLevel.MEMORY_AND_DISK)
> >>>.count()
> >>>
> >>> 
> >>>
> >>> 

[2 BUG REPORT] failed to run make-distribution.sh when a older version maven installed in system and run VersionsSuite test hang

2016-04-28 Thread Demon King
BUG 1:
I have installed maven 3.0.2 in system,  When I using make-distribution.sh
, it seem not use maven 3.2.2 but use /usr/local/bin/mvn to build spark. So
I add --force option in make-distribution.sh like this:

line 130:
VERSION=$("$MVN" *--force* help:evaluate -Dexpression=project.version $@
2>/dev/null | grep -v "INFO" | tail -n 1)
SCALA_VERSION=$("$MVN"* --force* help:evaluate
-Dexpression=scala.binary.version $@ 2>/dev/null\
| grep -v "INFO"\
| tail -n 1)
SPARK_HADOOP_VERSION=$("$MVN" *--force* help:evaluate
-Dexpression=hadoop.version $@ 2>/dev/null\
| grep -v "INFO"\
| tail -n 1)
SPARK_HIVE=$("$MVN"* --force* help:evaluate
-Dexpression=project.activeProfiles -pl sql/hive $@ 2>/dev/null\
| grep -v "INFO"\
| fgrep --count "hive";\
# Reset exit status to 0, otherwise the script stops here if the last
grep finds nothing\
# because we use "set -o pipefail"
echo -n)

line 170:
BUILD_COMMAND=("$MVN" *--force* clean package -DskipTests $@)

that will force spark to use build/mvn and solve this problem.

BUG 2:

When I run running unit test VersionsSuite, it will hang for one night or
more. I use jstack and lsof and find it try to send a http request. That
seems not be a good item when runing test in terrible network.

I use jstack and finally find out reason:

  java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:152)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
- locked <0x0007440224d8> (a java.io.BufferedInputStream)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:687)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:633)
at
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1323)
- locked <0x000744022530> (a
sun.net.www.protocol.http.HttpURLConnection)
at
java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:468)
...

and I use lsof:

java32082 user 247u  IPv4 527001934   TCP 8.8.8.8:33233 (LISTEN)
java32082 user  267u  IPv4 527001979   TCP 8.8.8.8:52301 (LISTEN)
java32082 user  316u  IPv4 527001999   TCP *:51993 (LISTEN)
java32082 user  521u  IPv4 527111590   TCP 8.8.8.8:53286
->butan141.server4you.de:http (ESTABLISHED)

This test suite try to connect butan141.server4you.de, The process will
hang when network is terrible .


Re: Spark on AWS

2016-04-28 Thread Fatma Ozcan
Thanks for the responses.
Fatma
On Apr 28, 2016 3:00 PM, "Renato Perini"  wrote:

> I have setup a small development cluster using t2.micro machines and an
> Amazon Linux AMI (CentOS 6.x).
> The whole setup has been done manually, without using the provided
> scripts. The whole setup is composed of a total of 5 instances: the first
> machine has an elastic IP and it is used as a bridge to access the other 4
> machines (they don't have elastic IPs). The second machine runs a
> standalone single node Spark cluster (1 master, 1 worker). The other 3
> machines are configured as an Apache Cassandra cluster. I have tuned the
> JVM and lots of parameters. I do not use S3 nor HDFS, I just write data
> using Spark Streaming (from an Apache Flume sink) to the 3 Cassandra nodes,
> that I use for data retrieval. Data is then processed through regular Spark
> jobs.
> The jobs are submitted to the cluster using LinkedIn Azkaban, executing
> custom shell scripts written by me for wrapping the submitting process and
> handling eventual command line arguments, at scheduled intervals. Results
> are written directly to other Cassandra tables or in a specific folder on
> the filesystem using the regular CSV format.
> The system is completely autonomous and requires little to no manual
> administration.
>
> I'm quite satisfied with it, considering how small and limited the
> machines involved are. But it required lots of tuning work, because we are
> clearly under the recommended requirements. 4 of the 5 machines are
> switched off during the night, only the bridge machine is alive 24/7.
>
> 12$ per month in total.
>
> Renato Perini.
>
>
> Il 28/04/2016 23:39, Fatma Ozcan ha scritto:
>
>> What is your experience using Spark on AWS? Are you setting up your own
>> Spark cluster, and using HDFS? Or are you using Spark as a service from
>> AWS? In the latter case, what is your experience of using S3 directly,
>> without having HDFS in between?
>>
>> Thanks,
>> Fatma
>>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Spark support for Complex Event Processing (CEP)

2016-04-28 Thread Michael Segel
Look, you said that you didn’t have continuous data, and you do have continuous 
data. I just used an analog signal which can be converted. So that you end up 
with contiguous digital sampling.  

The point is that you have to consider that micro batches are still batched and 
you’re adding latency. 
Even at 500ms, if you’re dealing with a high velocity low latency stream, that 
delay can kill you. 

Time is relative.  Which is why Spark Streaming isn’t good enough for *all* 
streaming. It wasn’t designed to be a contiguous stream. 
And by contiguous I mean that at any given time, there will be an inbound 
message in the queue. 


> On Apr 28, 2016, at 3:57 PM, Mich Talebzadeh  
> wrote:
> 
> Also the point about
> 
> "First there is this thing called analog signal processing…. Is that 
> continuous enough for you? "
> 
> I agree  that analog signal processing like a sine wave,  an AM radio signal 
> – is truly continuous. However,  here we are talking about digital data which 
> will always be sent as bytes and typically with bytes grouped into messages . 
> In other words when we are sending data it is never truly continuous.  We are 
> sending discrete messages.
> 
> HTH,
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
>  
> 
> On 28 April 2016 at 17:22, Mich Talebzadeh  > wrote:
> In a commerical (C)EP like say StreamBase, or for example its competitor 
> Apama, the arrival of an input event **immediately** triggers further 
> downstream processing.
> 
> This is admitadly an asynchronous approach, not a synchronous clock-driven 
> micro-batch approach like Spark's.
> 
> I suppose if one wants to split hairs / be philosophical, the clock rate of 
> the microprocessor chip underlies everything.  But I don't think that is 
> quite the point.
> 
> The point is that an asychonrous event-driven approach is as continuous / 
> immediate as **the underlying computer hardware will ever allow.**. It is not 
> limited by an architectural software clock.
> 
> So it is asynchronous vs synchronous that is the key issue, not just the 
> exact speed of the software clock in the synchronous approach.
> 
> It isalso indeed true that latencies down to the single digit microseconds 
> level can sometimes matter in financial trading but rarely.
> HTH
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
>  
> 
> On 28 April 2016 at 12:44, Michael Segel  > wrote:
> I don’t.
> 
> I believe that there have been a  couple of hack-a-thons like one done in 
> Chicago a few years back using public transportation data.
> 
> The first question is what sort of data do you get from the city? 
> 
> I mean it could be as simple as time_stamp, bus_id, route and GPS (x,y).   Or 
> they could provide more information. Like last stop, distance to next stop, 
> avg current velocity… 
> 
> Then there is the frequency of the updates. Every second? Every 3 seconds? 5 
> or 6 seconds…
> 
> This will determine how much work you have to do. 
> 
> Maybe they provide the routes of the busses via a different API call since 
> its relatively static.
> 
> This will drive your solution more than the underlying technology. 
> 
> Oh and whileI focused on bus, there are also rail and other modes of public 
> transportation like light rail, trains, etc … 
> 
> HTH
> 
> -Mike
> 
> 
>> On Apr 28, 2016, at 4:10 AM, Esa Heikkinen > > wrote:
>> 
>> 
>> Do you know any good examples how to use Spark streaming in tracking public 
>> transportation systems ?
>> 
>> Or Storm or some other tool example ?
>> 
>> Regards
>> Esa Heikkinen
>> 
>> 28.4.2016, 3:16, Michael Segel kirjoitti:
>>> Uhm… 
>>> I think you need to clarify a couple of things…
>>> 
>>> First there is this thing called analog signal processing…. Is that 
>>> continuous enough for you? 
>>> 
>>> But more to the point, Spark Streaming does micro batching so if you’re 
>>> processing a continuous stream of tick data, you will have more than 50K of 
>>> tics per second while there are markets open and trading.  Even at 50K a 
>>> second, that would mean 1 every .02 ms or 50 ticks a ms. 
>>> 
>>> And you don’t want to wait until you have a batch to start processing, but 
>>> you want to process when the data hits the queue and pull it from the queue 
>>> as quickly as possible. 
>>> 
>>> Spark 

Re: Spark on AWS

2016-04-28 Thread Renato Perini
I have setup a small development cluster using t2.micro machines and an 
Amazon Linux AMI (CentOS 6.x).
The whole setup has been done manually, without using the provided 
scripts. The whole setup is composed of a total of 5 instances: the 
first machine has an elastic IP and it is used as a bridge to access the 
other 4 machines (they don't have elastic IPs). The second machine runs 
a standalone single node Spark cluster (1 master, 1 worker). The other 3 
machines are configured as an Apache Cassandra cluster. I have tuned the 
JVM and lots of parameters. I do not use S3 nor HDFS, I just write data 
using Spark Streaming (from an Apache Flume sink) to the 3 Cassandra 
nodes, that I use for data retrieval. Data is then processed through 
regular Spark jobs.
The jobs are submitted to the cluster using LinkedIn Azkaban, executing 
custom shell scripts written by me for wrapping the submitting process 
and handling eventual command line arguments, at scheduled intervals. 
Results are written directly to other Cassandra tables or in a specific 
folder on the filesystem using the regular CSV format.
The system is completely autonomous and requires little to no manual 
administration.


I'm quite satisfied with it, considering how small and limited the 
machines involved are. But it required lots of tuning work, because we 
are clearly under the recommended requirements. 4 of the 5 machines are 
switched off during the night, only the bridge machine is alive 24/7.


12$ per month in total.

Renato Perini.


Il 28/04/2016 23:39, Fatma Ozcan ha scritto:
What is your experience using Spark on AWS? Are you setting up your 
own Spark cluster, and using HDFS? Or are you using Spark as a service 
from AWS? In the latter case, what is your experience of using S3 
directly, without having HDFS in between?


Thanks,
Fatma



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark on AWS

2016-04-28 Thread Alexander Pivovarov
Fatima, the easiest way to create Spark cluster on AWS is to create EMR
cluster and select Spark application. (the latest EMR includes Spark 1.6.1)

Spark works well with S3 (read and write). However it's recommended to
set spark.speculation true (it's expected that some tasks fail if you read
large S3 folder, so speculation should help)



On Thu, Apr 28, 2016 at 2:39 PM, Fatma Ozcan  wrote:

> What is your experience using Spark on AWS? Are you setting up your own
> Spark cluster, and using HDFS? Or are you using Spark as a service from
> AWS? In the latter case, what is your experience of using S3 directly,
> without having HDFS in between?
>
> Thanks,
> Fatma
>


Spark on AWS

2016-04-28 Thread Fatma Ozcan
What is your experience using Spark on AWS? Are you setting up your own
Spark cluster, and using HDFS? Or are you using Spark as a service from
AWS? In the latter case, what is your experience of using S3 directly,
without having HDFS in between?

Thanks,
Fatma


Re: Spark 2.0+ Structured Streaming

2016-04-28 Thread Tathagata Das
Hello Benjamin,

Have you take a look at the slides of my talk in Strata San Jose -
http://www.slideshare.net/databricks/taking-spark-streaming-to-the-next-level-with-datasets-and-dataframes
Unfortunately there is not video, as Strata does not upload videos for
everyone.
I presented the same talk at Kafka Summit couple of days back, that will
probably get uploaded soon.
Let me know if you have any more questions.

On Thu, Apr 28, 2016 at 5:19 AM, Benjamin Kim  wrote:

> Can someone explain to me how the new Structured Streaming works in the
> upcoming Spark 2.0+? I’m a little hazy how data will be stored and
> referenced if it can be queried and/or batch processed directly from
> streams and if the data will be append only to or will there be some sort
> of upsert capability available. This almost sounds similar to what AWS
> Kinesis is trying to achieve, but it can only store the data for 24 hours.
> Am I close?
>
> Thanks,
> Ben
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Spark support for Complex Event Processing (CEP)

2016-04-28 Thread Mich Talebzadeh
Also the point about

"First there is this thing called analog signal processing…. Is that
continuous enough for you? "

I agree  that analog signal processing like a sine wave,  an AM radio
signal – is truly continuous. However,  here we are talking about digital
data which will always be sent as bytes and typically with bytes grouped
into messages . In other words when we are sending data it is never truly
continuous.  We are sending discrete messages.


HTH,



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 28 April 2016 at 17:22, Mich Talebzadeh 
wrote:

> In a commerical (C)EP like say StreamBase, or for example its competitor
> Apama, the arrival of an input event **immediately** triggers further
> downstream processing.
>
> This is admitadly an asynchronous approach, not a synchronous clock-driven
> micro-batch approach like Spark's.
>
> I suppose if one wants to split hairs / be philosophical, the clock rate
> of the microprocessor chip underlies everything.  But I don't think that
> is quite the point.
>
> The point is that an asychonrous event-driven approach is as continuous /
> immediate as **the underlying computer hardware will ever allow.**. It
> is not limited by an architectural software clock.
>
> So it is asynchronous vs synchronous that is the key issue, not just the
> exact speed of the software clock in the synchronous approach.
>
> It isalso indeed true that latencies down to the single digit microseconds
> level can sometimes matter in financial trading but rarely.
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 28 April 2016 at 12:44, Michael Segel 
> wrote:
>
>> I don’t.
>>
>> I believe that there have been a  couple of hack-a-thons like one done in
>> Chicago a few years back using public transportation data.
>>
>> The first question is what sort of data do you get from the city?
>>
>> I mean it could be as simple as time_stamp, bus_id, route and GPS (x,y).
>>   Or they could provide more information. Like last stop, distance to next
>> stop, avg current velocity…
>>
>> Then there is the frequency of the updates. Every second? Every 3
>> seconds? 5 or 6 seconds…
>>
>> This will determine how much work you have to do.
>>
>> Maybe they provide the routes of the busses via a different API call
>> since its relatively static.
>>
>> This will drive your solution more than the underlying technology.
>>
>> Oh and whileI focused on bus, there are also rail and other modes of
>> public transportation like light rail, trains, etc …
>>
>> HTH
>>
>> -Mike
>>
>>
>> On Apr 28, 2016, at 4:10 AM, Esa Heikkinen 
>> wrote:
>>
>>
>> Do you know any good examples how to use Spark streaming in tracking
>> public transportation systems ?
>>
>> Or Storm or some other tool example ?
>>
>> Regards
>> Esa Heikkinen
>>
>> 28.4.2016, 3:16, Michael Segel kirjoitti:
>>
>> Uhm…
>> I think you need to clarify a couple of things…
>>
>> First there is this thing called analog signal processing…. Is that
>> continuous enough for you?
>>
>> But more to the point, Spark Streaming does micro batching so if you’re
>> processing a continuous stream of tick data, you will have more than 50K of
>> tics per second while there are markets open and trading.  Even at 50K a
>> second, that would mean 1 every .02 ms or 50 ticks a ms.
>>
>> And you don’t want to wait until you have a batch to start processing,
>> but you want to process when the data hits the queue and pull it from the
>> queue as quickly as possible.
>>
>> Spark streaming will be able to pull batches in as little as 500ms. So if
>> you pull a batch at t0 and immediately have a tick in your queue, you won’t
>> process that data until t0+500ms. And said batch would contain 25,000
>> entries.
>>
>> Depending on what you are doing… that 500ms delay can be enough to be
>> fatal to your trading process.
>>
>> If you don’t like stock data, there are other examples mainly when
>> pulling data from real time embedded systems.
>>
>>
>> If you go back and read what I said, if your data flow is >> (much
>> slower) than 500ms, and / or the time to process is >> 500ms ( much longer
>> )  you could use spark streaming.  If not… and there are applications which
>> require that type of speed…  then you shouldn’t use spark streaming.
>>
>> If you do have that constraint, then you can look at systems like
>> storm/flink/samza / whatever where you have a continuous queue and listener
>> and no micro batch delays.
>> Then for each bolt (storm) you can have a spark context for 

Re: Spark 2.0 Release Date

2016-04-28 Thread Jacek Laskowski
Hi Arun,

My bet is...https://spark-summit.org/2016 :)

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Thu, Apr 28, 2016 at 1:43 PM, Arun Patel  wrote:
> A small request.
>
> Would you mind providing an approximate date of Spark 2.0 release?  Is it
> early May or Mid May or End of May?
>
> Thanks,
> Arun

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Hadoop Context

2016-04-28 Thread Eric Friedman
Hello,

Where in the Spark APIs can I get access to the Hadoop Context instance?  I
am trying to implement the Spark equivalent of this

public void reduce(Text key, Iterable values, Context
context)

throws IOException, InterruptedException {

  if (record == null) {

throw new IOException("No output record found");

  }

  record.set("a", 125);

  record.set("b", true);

  record.set("c", 'c');

  record.set("d", new java.sql.Date
(Calendar.getInstance().getTimeInMillis()));

  record.set("f", 234.526);

  record.set("t", new java.sql.Timestamp
(Calendar.getInstance().getTimeInMillis()));

  record.set("v", "foobar string");

  record.set("z", new byte[10]);

  context.write(new Text("mrtarget"), record);


where record is a VerticaRecord.


Re: Addign a new column to a dataframe (based on value of existing column)

2016-04-28 Thread Marco Mistroni
many thx Nick

kr

On Thu, Apr 28, 2016 at 8:07 PM, Nick Pentreath 
wrote:

> This should work:
>
> scala> val df = Seq((25.0, "foo"), (30.0, "bar")).toDF("age", "name")
> scala> df.withColumn("AgeInt", when(col("age") > 29.0,
> 1).otherwise(0)).show
> +++--+
> | age|name|AgeInt|
>  +++--+
> |25.0| foo| 0|
>  |30.0| bar| 1|
>  +++--+
>
> On Thu, 28 Apr 2016 at 20:45 Marco Mistroni  wrote:
>
>> HI all
>>  i have a dataFrame with a column ("Age", type double) and i am trying to
>> create a new
>> column based on the value of the Age column, using Scala API
>>
>> this code keeps on complaining
>>
>> scala> df.withColumn("AgeInt", if (df("Age") > 29.0) lit(1) else lit(0))
>> :28: error: type mismatch;
>>  found   : org.apache.spark.sql.Column
>>  required: Boolean
>>   df.withColumn("AgeInt", if (df("Age") > 29.0) lit(1) else
>> lit(0))
>>
>> any suggestions?
>>
>> kind regars
>>  marco
>>
>


Re: Addign a new column to a dataframe (based on value of existing column)

2016-04-28 Thread Nick Pentreath
This should work:

scala> val df = Seq((25.0, "foo"), (30.0, "bar")).toDF("age", "name")
scala> df.withColumn("AgeInt", when(col("age") > 29.0, 1).otherwise(0)).show
+++--+
| age|name|AgeInt|
 +++--+
|25.0| foo| 0|
 |30.0| bar| 1|
 +++--+

On Thu, 28 Apr 2016 at 20:45 Marco Mistroni  wrote:

> HI all
>  i have a dataFrame with a column ("Age", type double) and i am trying to
> create a new
> column based on the value of the Age column, using Scala API
>
> this code keeps on complaining
>
> scala> df.withColumn("AgeInt", if (df("Age") > 29.0) lit(1) else lit(0))
> :28: error: type mismatch;
>  found   : org.apache.spark.sql.Column
>  required: Boolean
>   df.withColumn("AgeInt", if (df("Age") > 29.0) lit(1) else
> lit(0))
>
> any suggestions?
>
> kind regars
>  marco
>


Addign a new column to a dataframe (based on value of existing column)

2016-04-28 Thread Marco Mistroni
HI all
 i have a dataFrame with a column ("Age", type double) and i am trying to
create a new
column based on the value of the Age column, using Scala API

this code keeps on complaining

scala> df.withColumn("AgeInt", if (df("Age") > 29.0) lit(1) else lit(0))
:28: error: type mismatch;
 found   : org.apache.spark.sql.Column
 required: Boolean
  df.withColumn("AgeInt", if (df("Age") > 29.0) lit(1) else
lit(0))

any suggestions?

kind regars
 marco


aggregateByKey - external combine function

2016-04-28 Thread Nirav Patel
Hi,

I tried to convert a groupByKey operation to aggregateByKey in a hope to
avoid memory and high gc issue when dealing with 200GB of data.
I needed to create a Collection of resulting key-value pairs which
represent all combinations of given key.

My merge fun definition is as follows:

private def processDataMerge(map1: collection.mutable.Map[String,
UserDataSet],
  map2:
collection.mutable.Map[String, UserDataSet])
: collection.mutable.Map[String, UserDataSet] = {

//psuedo code

map1 + map2
(Set[combEle1], Set[combEle2] ... ) = map1.map(...extract all elements here)
comb1 = cominatorics(Set[CombELe1])
..
totalcombinations = comb1 + comb2 + ..

map1 + totalcombinations.map(comb => (comb -> UserDataSet))

}


Output of one merge(or seq) is basically combinations of input collection
elements and so and so on. So finally you get all combinations for given
key.

Its performing worst using aggregateByKey then groupByKey with same
configuration. GroupByKey used to halt at last 9 partitions out of 4000.
This one is halting even earlier. (halting due to high GC). I kill the job
after it halts for hours on same task.

I give 25GB executor memory and 4GB overhead. My cluster can't allocate
more than 32GB per executor.

I thought of custom partitioning my keys so there's less data per key and
hence less combination. that will help with data skew but wouldn't in the
end it would come to same thing? Like at some point it will need to merge
key-values spread across different salt and it will come to memory issue at
that point!

Any pointer to resolve this? perhaps an external merge ?

Thanks
Nirav



Thanks

-- 


[image: What's New with Xactly] 

  [image: LinkedIn] 
  [image: Twitter] 
  [image: Facebook] 
  [image: YouTube] 



Re: How does .jsonFile() work?

2016-04-28 Thread Aliaksandr Bedrytski
If your question is about how the schema is inferred for JSON,
the paragraph 5.1 from this paper
https://amplab.cs.berkeley.edu/wp-content/uploads/2015/03/SparkSQLSigmod2015.pdf

explains it quite well (long story short, Spark tries to find
the most specific type for the field, otherwise it is a string)

On Thu, Apr 28, 2016 at 5:53 PM harjitdotsingh 
wrote:

> From what I know and what I have played with, jsonFile reads JsonRecords
> which are defined as one record per line. Its not always the case that you
> can supply the data that way. If you have custom data json data where you
> cannot define a record per line, you will have to write your own
> customReceiver to receive the data and then parse it. I hope it makes
> sense.
> I wrote my own handler to read directory and that directory contained json
> files, I read until I have hit the EOF and then later call the store method
> which then sends the data to your driver.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-does-jsonFile-work-tp26802p26844.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Spark support for Complex Event Processing (CEP)

2016-04-28 Thread Mich Talebzadeh
In a commerical (C)EP like say StreamBase, or for example its competitor
Apama, the arrival of an input event **immediately** triggers further
downstream processing.

This is admitadly an asynchronous approach, not a synchronous clock-driven
micro-batch approach like Spark's.

I suppose if one wants to split hairs / be philosophical, the clock rate of
the microprocessor chip underlies everything.  But I don't think that
is quite the point.

The point is that an asychonrous event-driven approach is as continuous /
immediate as **the underlying computer hardware will ever allow.**. It
is not limited by an architectural software clock.

So it is asynchronous vs synchronous that is the key issue, not just the
exact speed of the software clock in the synchronous approach.

It isalso indeed true that latencies down to the single digit microseconds
level can sometimes matter in financial trading but rarely.

HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 28 April 2016 at 12:44, Michael Segel  wrote:

> I don’t.
>
> I believe that there have been a  couple of hack-a-thons like one done in
> Chicago a few years back using public transportation data.
>
> The first question is what sort of data do you get from the city?
>
> I mean it could be as simple as time_stamp, bus_id, route and GPS (x,y).
> Or they could provide more information. Like last stop, distance to next
> stop, avg current velocity…
>
> Then there is the frequency of the updates. Every second? Every 3 seconds?
> 5 or 6 seconds…
>
> This will determine how much work you have to do.
>
> Maybe they provide the routes of the busses via a different API call since
> its relatively static.
>
> This will drive your solution more than the underlying technology.
>
> Oh and whileI focused on bus, there are also rail and other modes of
> public transportation like light rail, trains, etc …
>
> HTH
>
> -Mike
>
>
> On Apr 28, 2016, at 4:10 AM, Esa Heikkinen 
> wrote:
>
>
> Do you know any good examples how to use Spark streaming in tracking
> public transportation systems ?
>
> Or Storm or some other tool example ?
>
> Regards
> Esa Heikkinen
>
> 28.4.2016, 3:16, Michael Segel kirjoitti:
>
> Uhm…
> I think you need to clarify a couple of things…
>
> First there is this thing called analog signal processing…. Is that
> continuous enough for you?
>
> But more to the point, Spark Streaming does micro batching so if you’re
> processing a continuous stream of tick data, you will have more than 50K of
> tics per second while there are markets open and trading.  Even at 50K a
> second, that would mean 1 every .02 ms or 50 ticks a ms.
>
> And you don’t want to wait until you have a batch to start processing, but
> you want to process when the data hits the queue and pull it from the queue
> as quickly as possible.
>
> Spark streaming will be able to pull batches in as little as 500ms. So if
> you pull a batch at t0 and immediately have a tick in your queue, you won’t
> process that data until t0+500ms. And said batch would contain 25,000
> entries.
>
> Depending on what you are doing… that 500ms delay can be enough to be
> fatal to your trading process.
>
> If you don’t like stock data, there are other examples mainly when pulling
> data from real time embedded systems.
>
>
> If you go back and read what I said, if your data flow is >> (much slower)
> than 500ms, and / or the time to process is >> 500ms ( much longer )  you
> could use spark streaming.  If not… and there are applications which
> require that type of speed…  then you shouldn’t use spark streaming.
>
> If you do have that constraint, then you can look at systems like
> storm/flink/samza / whatever where you have a continuous queue and listener
> and no micro batch delays.
> Then for each bolt (storm) you can have a spark context for processing the
> data. (Depending on what sort of processing you want to do.)
>
> To put this in perspective… if you’re using spark streaming / akka / storm
> /etc to handle real time requests from the web, 500ms added delay can be a
> long time.
>
> Choose the right tool.
>
> For the OP’s problem. Sure Tracking public transportation could be done
> using spark streaming. It could also be done using half a dozen other tools
> because the rate of data generation is much slower than 500ms.
>
> HTH
>
>
> On Apr 27, 2016, at 4:34 PM, Mich Talebzadeh 
> wrote:
>
> couple of things.
>
> There is no such thing as Continuous Data Streaming as there is no such
> thing as Continuous Availability.
>
> There is such thing as Discrete Data Streaming and  High Availability  but
> they reduce the finite unavailability to minimum. In terms of business
> needs a 5 SIGMA is good enough and acceptable. Even the 

Re: slow SQL query with cached dataset

2016-04-28 Thread Imran Akbar
Thanks Dr. Mich, Jorn,

It's about 150 million rows in the cached dataset.  How do I tell if it's
spilling to disk?  I didn't really see any logs to that affect.
How do I determine the optimal number of partitions for a given input
dataset?  What's too much?

regards,
imran

On Mon, Apr 25, 2016 at 3:55 PM, Mich Talebzadeh 
wrote:

> Are you sure it is not spilling to disk?
>
> How many rows are cached in your result set -> sqlContext.sql("SELECT *
> FROM raw WHERE (dt_year=2015 OR dt_year=2016)")
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 25 April 2016 at 23:47, Imran Akbar  wrote:
>
>> Hi,
>>
>> I'm running a simple query like this through Spark SQL:
>>
>> sqlContext.sql("SELECT MIN(age) FROM data WHERE country = 'GBR' AND
>> dt_year=2015 AND dt_month BETWEEN 1 AND 11 AND product IN
>> ('cereal')").show()
>>
>> which takes 3 minutes to run against an in-memory cache of 9 GB of data.
>>
>> The data was 100% cached in memory before I ran the query (see screenshot
>> 1).
>> The data was cached like this:
>> data = sqlContext.sql("SELECT * FROM raw WHERE (dt_year=2015 OR
>> dt_year=2016)")
>> data.cache()
>> data.registerTempTable("data")
>> and then I ran an action query to load the data into the cache.
>>
>> I see lots of rows of logs like this:
>> 16/04/25 22:39:11 INFO MemoryStore: Block rdd_13136_2856 stored as values
>> in memory (estimated size 2.5 MB, free 9.7 GB)
>> 16/04/25 22:39:11 INFO BlockManager: Found block rdd_13136_2856 locally
>> 16/04/25 22:39:11 INFO MemoryStore: 6 blocks selected for dropping
>> 16/04/25 22:39:11 INFO BlockManager: Dropping block rdd_13136_3866 from
>> memory
>>
>> Screenshot 2 shows the job page of the longest job.
>>
>> The data was partitioned in Parquet by month, country, and product before
>> I cached it.
>>
>> Any ideas what the issue could be?  This is running on localhost.
>>
>> regards,
>> imran
>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>
>


Re: slow SQL query with cached dataset

2016-04-28 Thread Mich Talebzadeh
Hi Imran,

" How do I tell if it's spilling to disk?"

Well that is a very valid question. I do not have a quantitative matrix to
use it to state that out of X GB of data in Spark, Y GB has been spilled to
disk because of the volume of data.

Unlike an RDBMS Spark uses memory ass opposed to shared memory. When RDBMS
hits the memory limit it starts swapping that one can see with swap -l

The only way I believe one can measure it by looking at parameters passed
to spark submit

${SPARK_HOME}/bin/spark-submit \
--packages com.databricks:spark-csv_2.11:1.3.0 \
--jars
/home/hduser/jars/spark-streaming-kafka-assembly_2.10-1.6.1.jar \
--class "${FILE_NAME}" \
--master spark://50.140.197.217:7077 \
--executor-memory=12G \
--executor-cores=12 \
--num-executors=2 \
${JAR_FILE}

So I have not seen a tool that shows the spillage of data quantitatively.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 28 April 2016 at 16:36, Imran Akbar  wrote:

> Thanks Dr. Mich, Jorn,
>
> It's about 150 million rows in the cached dataset.  How do I tell if it's
> spilling to disk?  I didn't really see any logs to that affect.
> How do I determine the optimal number of partitions for a given input
> dataset?  What's too much?
>
> regards,
> imran
>
> On Mon, Apr 25, 2016 at 3:55 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Are you sure it is not spilling to disk?
>>
>> How many rows are cached in your result set -> sqlContext.sql("SELECT *
>> FROM raw WHERE (dt_year=2015 OR dt_year=2016)")
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 25 April 2016 at 23:47, Imran Akbar  wrote:
>>
>>> Hi,
>>>
>>> I'm running a simple query like this through Spark SQL:
>>>
>>> sqlContext.sql("SELECT MIN(age) FROM data WHERE country = 'GBR' AND
>>> dt_year=2015 AND dt_month BETWEEN 1 AND 11 AND product IN
>>> ('cereal')").show()
>>>
>>> which takes 3 minutes to run against an in-memory cache of 9 GB of data.
>>>
>>> The data was 100% cached in memory before I ran the query (see
>>> screenshot 1).
>>> The data was cached like this:
>>> data = sqlContext.sql("SELECT * FROM raw WHERE (dt_year=2015 OR
>>> dt_year=2016)")
>>> data.cache()
>>> data.registerTempTable("data")
>>> and then I ran an action query to load the data into the cache.
>>>
>>> I see lots of rows of logs like this:
>>> 16/04/25 22:39:11 INFO MemoryStore: Block rdd_13136_2856 stored as
>>> values in memory (estimated size 2.5 MB, free 9.7 GB)
>>> 16/04/25 22:39:11 INFO BlockManager: Found block rdd_13136_2856 locally
>>> 16/04/25 22:39:11 INFO MemoryStore: 6 blocks selected for dropping
>>> 16/04/25 22:39:11 INFO BlockManager: Dropping block rdd_13136_3866 from
>>> memory
>>>
>>> Screenshot 2 shows the job page of the longest job.
>>>
>>> The data was partitioned in Parquet by month, country, and product
>>> before I cached it.
>>>
>>> Any ideas what the issue could be?  This is running on localhost.
>>>
>>> regards,
>>> imran
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>
>>
>


Re: How does .jsonFile() work?

2016-04-28 Thread harjitdotsingh
>From what I know and what I have played with, jsonFile reads JsonRecords
which are defined as one record per line. Its not always the case that you
can supply the data that way. If you have custom data json data where you
cannot define a record per line, you will have to write your own
customReceiver to receive the data and then parse it. I hope it makes sense.
I wrote my own handler to read directory and that directory contained json
files, I read until I have hit the EOF and then later call the store method
which then sends the data to your driver. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-does-jsonFile-work-tp26802p26844.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



access to nonnegative flag with ALS trainImplicit

2016-04-28 Thread Roberto Pagliari
I'm using ALS with mllib 1.5.2 in Scala.

I do not have access to the nonnegative flag in trainImplicit.

Which API is it available from?


Re: Could not access Spark webUI on OpenStack VMs

2016-04-28 Thread Ted Yu
What happened when you tried to access port 8080 ?

Checking iptables settings is good to do.

At my employer, we use OpenStack clusters daily and don't encounter much
problem - including UI access.
Probably some settings should be tuned.

On Thu, Apr 28, 2016 at 5:03 AM, Dan Dong  wrote:

> Hi, all,
>   I'm having problem to access the web UI of my Spark cluster. The cluster
> is composed of a few virtual machines running on a OpenStack platform. The
> VMs are launched from CentOS7.0 server image available from official site.
> The Spark itself runs well and master and worker process are all up and
> running, and run SparkPi example also get expected result. So, the question
> is, how to debug such a problem? Should it be a native problem of the
> CentOS image as it is not a desktop version, so some graphic packages might
> be missing in the VM?
> Or it is a iptables settings problem comes from OpenStack, as Openstack
> configures complex network inside it and it might block certain
> communication?
> Does anybody find similar problems? Any hints will be appreciated. Thanks!
>
> Cheers,
> Dong
>
>


Re: Is JavaSparkContext.wholeTextFiles distributed?

2016-04-28 Thread Xiangrui Meng
It implements CombineInputFormat from Hadoop. isSplittable=false means each
individual file cannot be split. If you only see one partition even with a
large minPartitions, perhaps the total size of files is not big enough.
Those are configurable in Hadoop conf. -Xiangrui

On Tue, Apr 26, 2016, 8:32 AM Hyukjin Kwon  wrote:

> EDIT: not mapper but a task for HadoopRDD maybe as far as I know.
>
> I think the most clear way is just to run a job on multiple files with the
> API and check the number of tasks in the job.
> On 27 Apr 2016 12:06 a.m., "Hyukjin Kwon"  wrote:
>
> wholeTextFile() API uses WholeTextFileInputFormat,
> https://github.com/apache/spark/blob/d6dc12ef0146ae409834c78737c116050961f350/core/src/main/scala/org/apache/spark/input/WholeTextFileInputFormat.scala,
> which returns false for isSplittable. In this case, only single mapper
> appears for the entire file as far as I know.
>
> And also https://spark.apache.org/docs/1.6.0/programming-guide.html
>
> If the file is single file, then this would not be distributed.
> On 26 Apr 2016 11:52 p.m., "Ted Yu"  wrote:
>
>> Please take a look at:
>> core/src/main/scala/org/apache/spark/SparkContext.scala
>>
>>* Do `val rdd = sparkContext.wholeTextFile("hdfs://a-hdfs-path")`,
>>*
>>*  then `rdd` contains
>>* {{{
>>*   (a-hdfs-path/part-0, its content)
>>*   (a-hdfs-path/part-1, its content)
>>*   ...
>>*   (a-hdfs-path/part-n, its content)
>>* }}}
>> ...
>>   * @param minPartitions A suggestion value of the minimal splitting
>> number for input data.
>>
>>   def wholeTextFiles(
>>   path: String,
>>   minPartitions: Int = defaultMinPartitions): RDD[(String, String)] =
>> withScope {
>>
>> On Tue, Apr 26, 2016 at 7:43 AM, Vadim Vararu 
>> wrote:
>>
>>> Hi guys,
>>>
>>> I'm trying to read many filed from s3 using
>>> JavaSparkContext.wholeTextFiles(...). Is that executed in a distributed
>>> manner? Please give me a link to the place in documentation where it's
>>> specified.
>>>
>>> Thanks, Vadim.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>


Re: Spark writing to secure zone throws : UnknownCryptoProtocolVersionException

2016-04-28 Thread Ted Yu
Interesting.

The phoenix dependency wasn't shown in the classpath of your previous email.

On Thu, Apr 28, 2016 at 4:12 AM, pierre lacave  wrote:

> Narrowed down to some version incompatibility with Phoenix 4.7 ,
>
> Including $SPARK_HOME/lib/phoenix-4.7.0-HBase-1.1-client-spark.jar to
> extraClassPath and that trigger the issue above.
>
> I ll have a go at adding the individual dependencies as opposed to this
> fat jar and see how it goes.
>
> Thanks
>
>
> On Thu, Apr 28, 2016 at 10:52 AM, pierre lacave  wrote:
>
>> Thanks Ted,
>>
>> I am actually using the hadoop free version of spark
>> (spark-1.5.0-bin-without-hadoop) over hadoop 2.6.1, so could very well be
>> related indeed.
>>
>> I have configured spark-env.sh with export
>> SPARK_DIST_CLASSPATH=$($HADOOP_PREFIX/bin/hadoop classpath), which is the
>> only version of hadoop on the system (2.6.1) and able to interface with
>> hdfs (on no secured zones)
>>
>> Interestingly running this in the repl works fine
>>
>> // Create a simple DataFrame, stored into a partition directory
>> val df1 = sc.makeRDD(1 to 5).map(i => (i, i * 2)).toDF("single", "double")
>> df1.write.parquet("/securedzone/test")
>>
>>
>> but if packaged as an app and ran in local or yarn client/cluster mode,
>> it fails with the error described.
>>
>> I am not including any hadoop specific, so not sure where the difference
>> in DFSClient could come from.
>>
>> [info] Loading project definition from
>> /Users/zoidberg/Documents/demo/x/trunk/src/jobs/project
>> [info] Set current project to root (in build
>> file:/Users/zoidberg/Documents/demo/x/trunk/src/jobs/)
>> [info] Updating
>> {file:/Users/zoidberg/Documents/demo/x/trunk/src/jobs/}common...
>> [info] com.demo.project:root_2.10:0.2.3 [S]
>> [info] com.demo.project:common_2.10:0.2.3 [S]
>> [info]   +-joda-time:joda-time:2.8.2
>> [info]
>> [info] Resolving org.fusesource.jansi#jansi;1.4 ...
>> [info] Done updating.
>> [info] Updating
>> {file:/Users/zoidberg/Documents/demo/x/trunk/src/jobs/}extract...
>> [info] Resolving org.fusesource.jansi#jansi;1.4 ...
>> [info] Done updating.
>> [info] com.demo.project:extract_2.10:0.2.3 [S]
>> [info]   +-com.demo.project:common_2.10:0.2.3 [S]
>> [info]   | +-joda-time:joda-time:2.8.2
>> [info]   |
>> [info]   +-com.databricks:spark-csv_2.10:1.3.0 [S]
>> [info] +-com.univocity:univocity-parsers:1.5.1
>> [info] +-org.apache.commons:commons-csv:1.1
>> [info]
>> [success] Total time: 9 s, completed 28-Apr-2016 10:40:25
>>
>>
>> I am assuming I do not need to rebuild spark to use it with hadoop 2.6.1
>> and that the spark with user provided hadoop would let me do that,
>>
>>
>> $HADOOP_PREFIX/bin/hadoop classpath expends to:
>>
>>
>> /usr/local/project/hadoop/conf:/usr/local/project/hadoop/share/hadoop/common/lib/*:/usr/local/project/hadoop/share/hadoop/common/*:/usr/local/project/hadoop/share/hadoop/hdfs:/usr/local/project/hadoop/share/hadoop/hdfs/lib/*:/usr/local/project/hadoop/share/hadoop/hdfs/*:/usr/local/project/hadoop/share/hadoop/yarn/lib/*:/usr/local/project/hadoop/share/hadoop/yarn/*:/usr/local/project/hadoop/share/hadoop/mapreduce/lib/*:/usr/local/project/hadoop/share/hadoop/mapreduce/*:/contrib/capacity-scheduler/*.jar
>>
>> Thanks
>>
>>
>> On Sun, Apr 24, 2016 at 2:20 AM, Ted Yu  wrote:
>>
>>> Can you check that the DFSClient Spark uses is the same version as on
>>> the server side ?
>>>
>>> The client and server (NameNode) negotiate a "crypto protocol version" -
>>> this is a forward-looking feature.
>>> Please note:
>>>
>>> bq. Client provided: []
>>>
>>> Meaning client didn't provide any supported crypto protocol version.
>>>
>>> Cheers
>>>
>>> On Wed, Apr 20, 2016 at 3:27 AM, pierre lacave  wrote:
>>>
 Hi


 I am trying to use spark to write to a protected zone in hdfs, I am able 
 to create and list file using the hdfs client but when writing via Spark I 
 get this exception.

 I could not find any mention of CryptoProtocolVersion in the spark doc.


 Any idea what could have gone wrong?


 spark (1.5.0), hadoop (2.6.1)


 Thanks


 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.UnknownCryptoProtocolVersionException):
  No crypto protocol versions provided by the client are supported. Client 
 provided: [] NameNode supports: 
 [CryptoProtocolVersion{description='Unknown', version=1, 
 unknownValue=null}, CryptoProtocolVersion{description='Encryption zones', 
 version=2, unknownValue=null}]
at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.chooseProtocolVersion(FSNamesystem.java:2468)
at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2600)
at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2520)
at 
 

Re: Spark 2.0 Release Date

2016-04-28 Thread Benjamin Kim
Next Thursday is Databricks' webinar on Spark 2.0. If you are attending, I bet 
many are going to ask when the release will be. Last time they did this, Spark 
1.6 came out not too long afterward.

> On Apr 28, 2016, at 5:21 AM, Sean Owen  wrote:
> 
> I don't know if anyone has begun a firm discussion on dates, but there
> are >100 open issues and ~10 blockers, so still some work to do before
> code freeze, it looks like. My unofficial guess is mid June before
> it's all done.
> 
> On Thu, Apr 28, 2016 at 12:43 PM, Arun Patel  wrote:
>> A small request.
>> 
>> Would you mind providing an approximate date of Spark 2.0 release?  Is it
>> early May or Mid May or End of May?
>> 
>> Thanks,
>> Arun
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Streaming K-means not printing predictions

2016-04-28 Thread Ashutosh Kumar
It is reading the files now but throws another error complaining vector
sizes does not match. I saw this error reported on stack trace .

http://stackoverflow.com/questions/30737361/getting-java-lang-illegalargumentexception-requirement-failed-while-calling-spa

Also example given in scala model.setRandomCenters takes two arguments ,
where as java method needs 3 ?

Any clues ?
Thanks
Ashutosh


On Wed, Apr 27, 2016 at 9:59 PM, Ashutosh Kumar 
wrote:

> The problem seems to be streamconxt.textFileStream(path) is not reading
> the file at all. It does not throw any exception also. I tried some tricks
> given in mailing lists  like copying the file to specified directory  after
> start of program, touching the file to change timestamp etc but no luck.
>
> Thanks
> Ashutosh
>
>
>
> On Wed, Apr 27, 2016 at 2:43 PM, Niki Pavlopoulou  wrote:
>
>> One of the reasons that happened to me (assuming everything is ok on your
>> streaming process), is if you run it on local mode instead of local[*] use
>> local[4].
>>
>> On 26 April 2016 at 15:10, Ashutosh Kumar 
>> wrote:
>>
>>> I created a Streaming k means based on scala example. It keeps running
>>> without any error but never prints predictions
>>>
>>> Here is Log
>>>
>>> 19:15:05,050 INFO
>>> org.apache.spark.streaming.scheduler.InputInfoTracker - remove old
>>> batch metadata: 146167824 ms
>>> 19:15:10,001 INFO
>>> org.apache.spark.streaming.dstream.FileInputDStream   - Finding new
>>> files took 1 ms
>>> 19:15:10,001 INFO
>>> org.apache.spark.streaming.dstream.FileInputDStream   - New files
>>> at time 146167831 ms:
>>>
>>> 19:15:10,007 INFO
>>> org.apache.spark.streaming.dstream.FileInputDStream   - Finding new
>>> files took 2 ms
>>> 19:15:10,007 INFO
>>> org.apache.spark.streaming.dstream.FileInputDStream   - New files
>>> at time 146167831 ms:
>>>
>>> 19:15:10,014 INFO
>>> org.apache.spark.streaming.scheduler.JobScheduler - Added jobs
>>> for time 146167831 ms
>>> 19:15:10,015 INFO
>>> org.apache.spark.streaming.scheduler.JobScheduler - Starting
>>> job streaming job 146167831 ms.0 from job set of time 146167831 ms
>>> 19:15:10,028 INFO
>>> org.apache.spark.SparkContext - Starting
>>> job: collect at StreamingKMeans.scala:89
>>> 19:15:10,028 INFO
>>> org.apache.spark.scheduler.DAGScheduler   - Job 292
>>> finished: collect at StreamingKMeans.scala:89, took 0.41 s
>>> 19:15:10,029 INFO
>>> org.apache.spark.streaming.scheduler.JobScheduler - Finished
>>> job streaming job 146167831 ms.0 from job set of time 146167831 ms
>>> 19:15:10,029 INFO
>>> org.apache.spark.streaming.scheduler.JobScheduler - Starting
>>> job streaming job 146167831 ms.1 from job set of time 146167831 ms
>>> ---
>>> Time: 146167831 ms
>>> ---
>>>
>>> 19:15:10,036 INFO
>>> org.apache.spark.streaming.scheduler.JobScheduler - Finished
>>> job streaming job 146167831 ms.1 from job set of time 146167831 ms
>>> 19:15:10,036 INFO
>>> org.apache.spark.rdd.MapPartitionsRDD - Removing
>>> RDD 2912 from persistence list
>>> 19:15:10,037 INFO
>>> org.apache.spark.rdd.MapPartitionsRDD - Removing
>>> RDD 2911 from persistence list
>>> 19:15:10,037 INFO
>>> org.apache.spark.storage.BlockManager - Removing
>>> RDD 2912
>>> 19:15:10,037 INFO
>>> org.apache.spark.streaming.scheduler.JobScheduler - Total
>>> delay: 0.036 s for time 146167831 ms (execution: 0.021 s)
>>> 19:15:10,037 INFO
>>> org.apache.spark.rdd.UnionRDD - Removing
>>> RDD 2800 from persistence list
>>> 19:15:10,037 INFO
>>> org.apache.spark.storage.BlockManager - Removing
>>> RDD 2911
>>> 19:15:10,037 INFO
>>> org.apache.spark.streaming.dstream.FileInputDStream   - Cleared 1
>>> old files that were older than 146167825 ms: 1461678245000 ms
>>> 19:15:10,037 INFO
>>> org.apache.spark.rdd.MapPartitionsRDD - Removing
>>> RDD 2917 from persistence list
>>> 19:15:10,037 INFO
>>> org.apache.spark.storage.BlockManager - Removing
>>> RDD 2800
>>> 19:15:10,037 INFO
>>> org.apache.spark.rdd.MapPartitionsRDD - Removing
>>> RDD 2916 from persistence list
>>> 19:15:10,037 INFO
>>> org.apache.spark.rdd.MapPartitionsRDD - Removing
>>> RDD 2915 from persistence list
>>> 19:15:10,037 INFO
>>> org.apache.spark.rdd.MapPartitionsRDD - Removing
>>> RDD 2914 from persistence list
>>> 19:15:10,037 INFO
>>> org.apache.spark.rdd.UnionRDD - Removing
>>> RDD 2803 from persistence list
>>> 19:15:10,037 INFO
>>> 

Re: Spark 2.0 Release Date

2016-04-28 Thread Sean Owen
I don't know if anyone has begun a firm discussion on dates, but there
are >100 open issues and ~10 blockers, so still some work to do before
code freeze, it looks like. My unofficial guess is mid June before
it's all done.

On Thu, Apr 28, 2016 at 12:43 PM, Arun Patel  wrote:
> A small request.
>
> Would you mind providing an approximate date of Spark 2.0 release?  Is it
> early May or Mid May or End of May?
>
> Thanks,
> Arun

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark 2.0+ Structured Streaming

2016-04-28 Thread Benjamin Kim
Can someone explain to me how the new Structured Streaming works in the 
upcoming Spark 2.0+? I’m a little hazy how data will be stored and referenced 
if it can be queried and/or batch processed directly from streams and if the 
data will be append only to or will there be some sort of upsert capability 
available. This almost sounds similar to what AWS Kinesis is trying to achieve, 
but it can only store the data for 24 hours. Am I close?

Thanks,
Ben
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Could not access Spark webUI on OpenStack VMs

2016-04-28 Thread Dan Dong
Hi, all,
  I'm having problem to access the web UI of my Spark cluster. The cluster
is composed of a few virtual machines running on a OpenStack platform. The
VMs are launched from CentOS7.0 server image available from official site.
The Spark itself runs well and master and worker process are all up and
running, and run SparkPi example also get expected result. So, the question
is, how to debug such a problem? Should it be a native problem of the
CentOS image as it is not a desktop version, so some graphic packages might
be missing in the VM?
Or it is a iptables settings problem comes from OpenStack, as Openstack
configures complex network inside it and it might block certain
communication?
Does anybody find similar problems? Any hints will be appreciated. Thanks!

Cheers,
Dong


Re: Spark support for Complex Event Processing (CEP)

2016-04-28 Thread Michael Segel
I don’t.

I believe that there have been a  couple of hack-a-thons like one done in 
Chicago a few years back using public transportation data.

The first question is what sort of data do you get from the city? 

I mean it could be as simple as time_stamp, bus_id, route and GPS (x,y).   Or 
they could provide more information. Like last stop, distance to next stop, avg 
current velocity… 

Then there is the frequency of the updates. Every second? Every 3 seconds? 5 or 
6 seconds…

This will determine how much work you have to do. 

Maybe they provide the routes of the busses via a different API call since its 
relatively static.

This will drive your solution more than the underlying technology. 

Oh and whileI focused on bus, there are also rail and other modes of public 
transportation like light rail, trains, etc … 

HTH

-Mike


> On Apr 28, 2016, at 4:10 AM, Esa Heikkinen  
> wrote:
> 
> 
> Do you know any good examples how to use Spark streaming in tracking public 
> transportation systems ?
> 
> Or Storm or some other tool example ?
> 
> Regards
> Esa Heikkinen
> 
> 28.4.2016, 3:16, Michael Segel kirjoitti:
>> Uhm… 
>> I think you need to clarify a couple of things…
>> 
>> First there is this thing called analog signal processing…. Is that 
>> continuous enough for you? 
>> 
>> But more to the point, Spark Streaming does micro batching so if you’re 
>> processing a continuous stream of tick data, you will have more than 50K of 
>> tics per second while there are markets open and trading.  Even at 50K a 
>> second, that would mean 1 every .02 ms or 50 ticks a ms. 
>> 
>> And you don’t want to wait until you have a batch to start processing, but 
>> you want to process when the data hits the queue and pull it from the queue 
>> as quickly as possible. 
>> 
>> Spark streaming will be able to pull batches in as little as 500ms. So if 
>> you pull a batch at t0 and immediately have a tick in your queue, you won’t 
>> process that data until t0+500ms. And said batch would contain 25,000 
>> entries. 
>> 
>> Depending on what you are doing… that 500ms delay can be enough to be fatal 
>> to your trading process. 
>> 
>> If you don’t like stock data, there are other examples mainly when pulling 
>> data from real time embedded systems. 
>> 
>> 
>> If you go back and read what I said, if your data flow is >> (much slower) 
>> than 500ms, and / or the time to process is >> 500ms ( much longer )  you 
>> could use spark streaming.  If not… and there are applications which require 
>> that type of speed…  then you shouldn’t use spark streaming. 
>> 
>> If you do have that constraint, then you can look at systems like 
>> storm/flink/samza / whatever where you have a continuous queue and listener 
>> and no micro batch delays.
>> Then for each bolt (storm) you can have a spark context for processing the 
>> data. (Depending on what sort of processing you want to do.) 
>> 
>> To put this in perspective… if you’re using spark streaming / akka / storm 
>> /etc to handle real time requests from the web, 500ms added delay can be a 
>> long time. 
>> 
>> Choose the right tool. 
>> 
>> For the OP’s problem. Sure Tracking public transportation could be done 
>> using spark streaming. It could also be done using half a dozen other tools 
>> because the rate of data generation is much slower than 500ms. 
>> 
>> HTH
>> 
>> 
>>> On Apr 27, 2016, at 4:34 PM, Mich Talebzadeh >> > wrote:
>>> 
>>> couple of things.
>>> 
>>> There is no such thing as Continuous Data Streaming as there is no such 
>>> thing as Continuous Availability.
>>> 
>>> There is such thing as Discrete Data Streaming and  High Availability  but 
>>> they reduce the finite unavailability to minimum. In terms of business 
>>> needs a 5 SIGMA is good enough and acceptable. Even the candles set to a 
>>> predefined time interval say 2, 4, 15 seconds overlap. No FX savvy trader 
>>> makes a sell or buy decision on the basis of 2 seconds candlestick
>>> 
>>> The calculation itself in measurements is subject to finite error as 
>>> defined by their Confidence Level (CL) using Standard Deviation function.
>>> 
>>> OK so far I have never noticed a tool that requires that details of 
>>> granularity. Those stuff from Flink etc is in practical term is of little 
>>> value and does not make commercial sense.
>>> 
>>> Now with regard to your needs, Spark micro batching is perfectly adequate.
>>> 
>>> HTH
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn   
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> 
>>>  
>>> http://talebzadehmich.wordpress.com 
>>>  
>>> 
>>> On 27 April 2016 at 22:10, Esa Heikkinen < 
>>> 

Spark 2.0 Release Date

2016-04-28 Thread Arun Patel
A small request.

Would you mind providing an approximate date of Spark 2.0 release?  Is it
early May or Mid May or End of May?

Thanks,
Arun


Re: Spark writing to secure zone throws : UnknownCryptoProtocolVersionException

2016-04-28 Thread pierre lacave
Narrowed down to some version incompatibility with Phoenix 4.7 ,

Including $SPARK_HOME/lib/phoenix-4.7.0-HBase-1.1-client-spark.jar to
extraClassPath and that trigger the issue above.

I ll have a go at adding the individual dependencies as opposed to this fat
jar and see how it goes.

Thanks


On Thu, Apr 28, 2016 at 10:52 AM, pierre lacave  wrote:

> Thanks Ted,
>
> I am actually using the hadoop free version of spark
> (spark-1.5.0-bin-without-hadoop) over hadoop 2.6.1, so could very well be
> related indeed.
>
> I have configured spark-env.sh with export
> SPARK_DIST_CLASSPATH=$($HADOOP_PREFIX/bin/hadoop classpath), which is the
> only version of hadoop on the system (2.6.1) and able to interface with
> hdfs (on no secured zones)
>
> Interestingly running this in the repl works fine
>
> // Create a simple DataFrame, stored into a partition directory
> val df1 = sc.makeRDD(1 to 5).map(i => (i, i * 2)).toDF("single", "double")
> df1.write.parquet("/securedzone/test")
>
>
> but if packaged as an app and ran in local or yarn client/cluster mode, it
> fails with the error described.
>
> I am not including any hadoop specific, so not sure where the difference
> in DFSClient could come from.
>
> [info] Loading project definition from
> /Users/zoidberg/Documents/demo/x/trunk/src/jobs/project
> [info] Set current project to root (in build
> file:/Users/zoidberg/Documents/demo/x/trunk/src/jobs/)
> [info] Updating
> {file:/Users/zoidberg/Documents/demo/x/trunk/src/jobs/}common...
> [info] com.demo.project:root_2.10:0.2.3 [S]
> [info] com.demo.project:common_2.10:0.2.3 [S]
> [info]   +-joda-time:joda-time:2.8.2
> [info]
> [info] Resolving org.fusesource.jansi#jansi;1.4 ...
> [info] Done updating.
> [info] Updating
> {file:/Users/zoidberg/Documents/demo/x/trunk/src/jobs/}extract...
> [info] Resolving org.fusesource.jansi#jansi;1.4 ...
> [info] Done updating.
> [info] com.demo.project:extract_2.10:0.2.3 [S]
> [info]   +-com.demo.project:common_2.10:0.2.3 [S]
> [info]   | +-joda-time:joda-time:2.8.2
> [info]   |
> [info]   +-com.databricks:spark-csv_2.10:1.3.0 [S]
> [info] +-com.univocity:univocity-parsers:1.5.1
> [info] +-org.apache.commons:commons-csv:1.1
> [info]
> [success] Total time: 9 s, completed 28-Apr-2016 10:40:25
>
>
> I am assuming I do not need to rebuild spark to use it with hadoop 2.6.1
> and that the spark with user provided hadoop would let me do that,
>
>
> $HADOOP_PREFIX/bin/hadoop classpath expends to:
>
>
> /usr/local/project/hadoop/conf:/usr/local/project/hadoop/share/hadoop/common/lib/*:/usr/local/project/hadoop/share/hadoop/common/*:/usr/local/project/hadoop/share/hadoop/hdfs:/usr/local/project/hadoop/share/hadoop/hdfs/lib/*:/usr/local/project/hadoop/share/hadoop/hdfs/*:/usr/local/project/hadoop/share/hadoop/yarn/lib/*:/usr/local/project/hadoop/share/hadoop/yarn/*:/usr/local/project/hadoop/share/hadoop/mapreduce/lib/*:/usr/local/project/hadoop/share/hadoop/mapreduce/*:/contrib/capacity-scheduler/*.jar
>
> Thanks
>
>
> On Sun, Apr 24, 2016 at 2:20 AM, Ted Yu  wrote:
>
>> Can you check that the DFSClient Spark uses is the same version as on
>> the server side ?
>>
>> The client and server (NameNode) negotiate a "crypto protocol version" -
>> this is a forward-looking feature.
>> Please note:
>>
>> bq. Client provided: []
>>
>> Meaning client didn't provide any supported crypto protocol version.
>>
>> Cheers
>>
>> On Wed, Apr 20, 2016 at 3:27 AM, pierre lacave  wrote:
>>
>>> Hi
>>>
>>>
>>> I am trying to use spark to write to a protected zone in hdfs, I am able to 
>>> create and list file using the hdfs client but when writing via Spark I get 
>>> this exception.
>>>
>>> I could not find any mention of CryptoProtocolVersion in the spark doc.
>>>
>>>
>>> Any idea what could have gone wrong?
>>>
>>>
>>> spark (1.5.0), hadoop (2.6.1)
>>>
>>>
>>> Thanks
>>>
>>>
>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.UnknownCryptoProtocolVersionException):
>>>  No crypto protocol versions provided by the client are supported. Client 
>>> provided: [] NameNode supports: 
>>> [CryptoProtocolVersion{description='Unknown', version=1, 
>>> unknownValue=null}, CryptoProtocolVersion{description='Encryption zones', 
>>> version=2, unknownValue=null}]
>>> at 
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.chooseProtocolVersion(FSNamesystem.java:2468)
>>> at 
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2600)
>>> at 
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2520)
>>> at 
>>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:579)
>>> at 
>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:394)
>>> at 
>>> 

Re: Reading from Amazon S3

2016-04-28 Thread Gourav Sengupta
Why would you use JAVA (create a problem and then try to solve it)? Have
you tried using Scala or Python or even R?

Regards,
Gourav

On Thu, Apr 28, 2016 at 10:07 AM, Steve Loughran 
wrote:

>
> On 26 Apr 2016, at 18:49, Ted Yu  wrote:
>
> Looking at the cause of the error, it seems hadoop-aws-xx.jar
> (corresponding to the version of hadoop you use) was missing in classpath.
>
>
> yes, that s3n was moved from hadoop-common to the new hadoop-aws, and
> without realising it broke a lot of things.
>
> you'll need hadoop-aws and jets3t on the classpath
>
> If you are using Hadoop 2.7, I'd recommend s3a instead, which means
> hadoop-aws and the exact same amazon-sdk that comes bundled with the hadoop
> binaries your version of spark is built with (it's a bit brittle API-wise)
>
>
> FYI
>
> On Tue, Apr 26, 2016 at 9:06 AM, Jinan Alhajjaj 
> wrote:
>
>> Hi All,
>> I am trying to read a file stored in Amazon S3.
>> I wrote this code:
>>
>> import java.util.List;
>>
>> import java.util.Scanner;
>>
>> import org.apache.spark.SparkConf;
>>
>> import org.apache.spark.api.java.JavaRDD;
>>
>> import org.apache.spark.api.java.JavaSparkContext;
>>
>> import org.apache.spark.api.java.function.Function;
>>
>> import org.apache.spark.sql.DataFrame;
>>
>> import org.apache.spark.sql.Row;
>>
>> import org.apache.spark.sql.SQLContext;
>>
>> public class WordAnalysis {
>>
>> public static void main(String[] args) {
>>
>> int startYear=0;
>>
>> int endyear=0;
>>
>> Scanner input = new Scanner(System.in);
>>
>> System.out.println("Please, Enter 1 if you want 1599-2008 or enter 2
>> for specific range: ");
>>
>> int choice=input.nextInt();
>>
>>
>> if(choice==1)
>>
>> {
>>
>> startYear=1500;
>>
>> endyear=2008;
>>
>> }
>>
>> if(choice==2)
>>
>> {
>>
>> System.out.print("please,Enter the start year : ");
>>
>> startYear = input.nextInt();
>>
>> System.out.print("please,Enter the end year : ");
>>
>> endyear = input.nextInt();
>>
>> }
>>
>> SparkConf conf = new SparkConf().setAppName("jinantry").setMaster("local"
>> );
>>
>> JavaSparkContext spark = new JavaSparkContext(conf);
>>
>> SQLContext sqlContext = new org.apache.spark.sql.SQLContext(spark);
>>
>> JavaRDD ngram = spark.textFile("
>> s3n://google-books-ngram/1gram/googlebooks-eng-all-1gram-20120701-x.gz‏")
>>
>> .map(new Function() {
>>
>> public Items call(String line) throws Exception {
>>
>> String[] parts = line.split("\t");
>>
>> Items item = new Items();
>>
>> if (parts.length == 4) {
>>
>> item.setWord(parts[0]);
>>
>> item.setYear(Integer.parseInt(parts[1]));
>>
>> item.setCount(Integer.parseInt(parts[2]));
>>
>> item.setVolume(Integer.parseInt(parts[3]));
>>
>> return item;
>>
>> } else {
>>
>> item.setWord(" ");
>>
>> item.setYear(Integer.parseInt(" "));
>>
>> item.setCount(Integer.parseInt(" "));
>>
>> item.setVolume(Integer.parseInt(" "));
>>
>> return item;
>>
>> }
>>
>> }
>>
>> });
>>
>> DataFrame schemangram = sqlContext.createDataFrame(ngram, Items.class);
>>
>> schemangram.registerTempTable("ngram");
>>
>> String sql="SELECT word,SUM(count) FROM ngram where year >="+startYear+"
>> AND year<="+endyear+" And word LIKE '%_NOUN' GROUP BY word ORDER BY
>> SUM(count) DESC";
>>
>> DataFrame matchyear = sqlContext.sql(sql);
>>
>> List words=matchyear.collectAsList();
>>
>> int i=1;
>>
>> for (Row scholar : words) {
>>
>> System.out.println(scholar);
>>
>> if(i==10)
>>
>> break;
>>
>> i++;
>>
>>   }
>>
>>
>> }
>>
>>
>> }
>>
>>
>> When I run it this error appear to me:
>>
>> Exception in thread "main"
>> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute,
>> tree:
>>
>> Exchange rangepartitioning(aggOrder#5L DESC,200), None
>>
>> +- ConvertToSafe
>>
>>+- TungstenAggregate(key=[word#2], functions=[(sum(cast(count#0 as
>> bigint)),mode=Final,isDistinct=false)], output=[word#2,_c1#4L,aggOrder#5L])
>>
>>   +- TungstenExchange hashpartitioning(word#2,200), None
>>
>>  +- TungstenAggregate(key=[word#2], functions=[(sum(cast(count#0
>> as bigint)),mode=Partial,isDistinct=false)], output=[word#2,sum#8L])
>>
>> +- Project [word#2,count#0]
>>
>>+- Filter (((year#3 >= 1500) && (year#3 <= 1600)) &&
>> word#2 LIKE %_NOUN)
>>
>>   +- Scan ExistingRDD[count#0,volume#1,word#2,year#3]
>>
>>
>> at
>> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
>>
>> at org.apache.spark.sql.execution.Exchange.doExecute(Exchange.scala:247)
>>
>> at
>> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
>>
>> at
>> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
>>
>> at
>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>>
>> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
>>
>> at
>> 

Re: spark sql job create too many files in HDFS when doing insert overwrite hive table

2016-04-28 Thread linxi zeng
BTW, I have created a JIRA task to follow this issue:
https://issues.apache.org/jira/browse/SPARK-14974

2016-04-28 18:08 GMT+08:00 linxi zeng :

> Hi,
>
> Recently, we often encounter problems using spark sql for inserting data
> into a partition table (ex.: insert overwrite table $output_table
> partition(dt) select xxx from tmp_table).
> After the spark job start running on yarn, *the app will create too many
> files (ex. 200w+, or even 1000w+), which will make HDFS under enormous
> pressure*.
> We found that the num of files created by spark job is depending on the
> partition num of hive table that will be inserted and the num of spark sql
> partitions.
> *files_num = hive_table_partions_num * spark_sql_partitions_num*.
> We often make the spark_sql_partitions_num(spark.sql.shuffle.partitions)
> >= 1000, and the hive_table_partions_num is very small under normal
> circumstances, but it will turn out to be more than 2000 when we input a
> wrong field as the partion field unconsciously, which will make the
> files_num >= 1000 * 2000 = 200w.
>
> There is a configuration parameter in hive that can limit the maximum
> number of dynamic partitions allowed to be created in each mapper/reducer
> named *hive.exec.max.dynamic.partitions.pernode*, but this conf parameter
> did't work when we use hiveContext.
>
> Reducing spark_sql_partitions_num(spark.sql.shuffle.partitions) can make
> the files_num be smaller, but it will affect the concurrency.
>
> Can we create configuration parameters to  limit the maximum number of
> files allowed to be create by each task or limit the
> spark_sql_partitions_num without affect the concurrency?
>


Re: Save DataFrame to HBase

2016-04-28 Thread Ted Yu
Hbase 2.0 release likely would come after Spark 2.0 release. 

There're other features being developed in hbase 2.0
I am not sure when hbase 2.0 would be released. 

The refguide is incomplete. 
Zhan has assigned the doc JIRA to himself. The documentation would be done 
after fixing bugs in hbase-spark module. 

Cheers

> On Apr 27, 2016, at 10:31 PM, Benjamin Kim  wrote:
> 
> Hi Ted,
> 
> Do you know when the release will be? I also see some documentation for usage 
> of the hbase-spark module at the hbase website. But, I don’t see an example 
> on how to save data. There is only one for reading/querying data. Will this 
> be added when the final version does get released?
> 
> Thanks,
> Ben
> 
>> On Apr 21, 2016, at 6:56 AM, Ted Yu  wrote:
>> 
>> The hbase-spark module in Apache HBase (coming with hbase 2.0 release) can 
>> do this.
>> 
>>> On Thu, Apr 21, 2016 at 6:52 AM, Benjamin Kim  wrote:
>>> Has anyone found an easy way to save a DataFrame into HBase?
>>> 
>>> Thanks,
>>> Ben
>>> 
>>> 
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
> 


spark sql job create too many files in HDFS when doing insert overwrite hive table

2016-04-28 Thread linxi zeng
Hi,

Recently, we often encounter problems using spark sql for inserting data
into a partition table (ex.: insert overwrite table $output_table
partition(dt) select xxx from tmp_table).
After the spark job start running on yarn, *the app will create too many
files (ex. 200w+, or even 1000w+), which will make HDFS under enormous
pressure*.
We found that the num of files created by spark job is depending on the
partition num of hive table that will be inserted and the num of spark sql
partitions.
*files_num = hive_table_partions_num * spark_sql_partitions_num*.
We often make the spark_sql_partitions_num(spark.sql.shuffle.partitions) >=
1000, and the hive_table_partions_num is very small under normal
circumstances, but it will turn out to be more than 2000 when we input a
wrong field as the partion field unconsciously, which will make the
files_num >= 1000 * 2000 = 200w.

There is a configuration parameter in hive that can limit the maximum
number of dynamic partitions allowed to be created in each mapper/reducer
named *hive.exec.max.dynamic.partitions.pernode*, but this conf parameter
did't work when we use hiveContext.

Reducing spark_sql_partitions_num(spark.sql.shuffle.partitions) can make
the files_num be smaller, but it will affect the concurrency.

Can we create configuration parameters to  limit the maximum number of
files allowed to be create by each task or limit the
spark_sql_partitions_num without affect the concurrency?


Re: Spark writing to secure zone throws : UnknownCryptoProtocolVersionException

2016-04-28 Thread pierre lacave
Thanks Ted,

I am actually using the hadoop free version of spark
(spark-1.5.0-bin-without-hadoop) over hadoop 2.6.1, so could very well be
related indeed.

I have configured spark-env.sh with export
SPARK_DIST_CLASSPATH=$($HADOOP_PREFIX/bin/hadoop classpath), which is the
only version of hadoop on the system (2.6.1) and able to interface with
hdfs (on no secured zones)

Interestingly running this in the repl works fine

// Create a simple DataFrame, stored into a partition directory
val df1 = sc.makeRDD(1 to 5).map(i => (i, i * 2)).toDF("single", "double")
df1.write.parquet("/securedzone/test")


but if packaged as an app and ran in local or yarn client/cluster mode, it
fails with the error described.

I am not including any hadoop specific, so not sure where the difference in
DFSClient could come from.

[info] Loading project definition from
/Users/zoidberg/Documents/demo/x/trunk/src/jobs/project
[info] Set current project to root (in build
file:/Users/zoidberg/Documents/demo/x/trunk/src/jobs/)
[info] Updating
{file:/Users/zoidberg/Documents/demo/x/trunk/src/jobs/}common...
[info] com.demo.project:root_2.10:0.2.3 [S]
[info] com.demo.project:common_2.10:0.2.3 [S]
[info]   +-joda-time:joda-time:2.8.2
[info]
[info] Resolving org.fusesource.jansi#jansi;1.4 ...
[info] Done updating.
[info] Updating
{file:/Users/zoidberg/Documents/demo/x/trunk/src/jobs/}extract...
[info] Resolving org.fusesource.jansi#jansi;1.4 ...
[info] Done updating.
[info] com.demo.project:extract_2.10:0.2.3 [S]
[info]   +-com.demo.project:common_2.10:0.2.3 [S]
[info]   | +-joda-time:joda-time:2.8.2
[info]   |
[info]   +-com.databricks:spark-csv_2.10:1.3.0 [S]
[info] +-com.univocity:univocity-parsers:1.5.1
[info] +-org.apache.commons:commons-csv:1.1
[info]
[success] Total time: 9 s, completed 28-Apr-2016 10:40:25


I am assuming I do not need to rebuild spark to use it with hadoop 2.6.1
and that the spark with user provided hadoop would let me do that,


$HADOOP_PREFIX/bin/hadoop classpath expends to:

/usr/local/project/hadoop/conf:/usr/local/project/hadoop/share/hadoop/common/lib/*:/usr/local/project/hadoop/share/hadoop/common/*:/usr/local/project/hadoop/share/hadoop/hdfs:/usr/local/project/hadoop/share/hadoop/hdfs/lib/*:/usr/local/project/hadoop/share/hadoop/hdfs/*:/usr/local/project/hadoop/share/hadoop/yarn/lib/*:/usr/local/project/hadoop/share/hadoop/yarn/*:/usr/local/project/hadoop/share/hadoop/mapreduce/lib/*:/usr/local/project/hadoop/share/hadoop/mapreduce/*:/contrib/capacity-scheduler/*.jar

Thanks


On Sun, Apr 24, 2016 at 2:20 AM, Ted Yu  wrote:

> Can you check that the DFSClient Spark uses is the same version as on the
> server side ?
>
> The client and server (NameNode) negotiate a "crypto protocol version" -
> this is a forward-looking feature.
> Please note:
>
> bq. Client provided: []
>
> Meaning client didn't provide any supported crypto protocol version.
>
> Cheers
>
> On Wed, Apr 20, 2016 at 3:27 AM, pierre lacave  wrote:
>
>> Hi
>>
>>
>> I am trying to use spark to write to a protected zone in hdfs, I am able to 
>> create and list file using the hdfs client but when writing via Spark I get 
>> this exception.
>>
>> I could not find any mention of CryptoProtocolVersion in the spark doc.
>>
>>
>> Any idea what could have gone wrong?
>>
>>
>> spark (1.5.0), hadoop (2.6.1)
>>
>>
>> Thanks
>>
>>
>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.UnknownCryptoProtocolVersionException):
>>  No crypto protocol versions provided by the client are supported. Client 
>> provided: [] NameNode supports: 
>> [CryptoProtocolVersion{description='Unknown', version=1, unknownValue=null}, 
>> CryptoProtocolVersion{description='Encryption zones', version=2, 
>> unknownValue=null}]
>>  at 
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.chooseProtocolVersion(FSNamesystem.java:2468)
>>  at 
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2600)
>>  at 
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2520)
>>  at 
>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:579)
>>  at 
>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:394)
>>  at 
>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>  at 
>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
>>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
>>  at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2040)
>>  at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2036)
>>  at java.security.AccessController.doPrivileged(Native Method)
>>  at javax.security.auth.Subject.doAs(Subject.java:415)
>>  at 
>> 

Classification or grouping of analyzing tools

2016-04-28 Thread Esa Heikkinen


Hi

I am newbie in this analyzing field. It seems there are exist many 
tools, frameworks, ecosystems, softwares, languages and so on.


1) Are there exist some classifications or groupings  for them ?
2) What kind of types of tools are exist ?
3) What are the main purposes ot tools ?

Regards
Esa Heikkinen

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark support for Complex Event Processing (CEP)

2016-04-28 Thread Esa Heikkinen


Do you know any good examples how to use Spark streaming in tracking 
public transportation systems ?


Or Storm or some other tool example ?

Regards
Esa Heikkinen

28.4.2016, 3:16, Michael Segel kirjoitti:

Uhm…
I think you need to clarify a couple of things…

First there is this thing called analog signal processing…. Is that 
continuous enough for you?


But more to the point, Spark Streaming does micro batching so if 
you’re processing a continuous stream of tick data, you will have more 
than 50K of tics per second while there are markets open and trading. 
 Even at 50K a second, that would mean 1 every .02 ms or 50 ticks a ms.


And you don’t want to wait until you have a batch to start processing, 
but you want to process when the data hits the queue and pull it from 
the queue as quickly as possible.


Spark streaming will be able to pull batches in as little as 500ms. So 
if you pull a batch at t0 and immediately have a tick in your queue, 
you won’t process that data until t0+500ms. And said batch would 
contain 25,000 entries.


Depending on what you are doing… that 500ms delay can be enough to be 
fatal to your trading process.


If you don’t like stock data, there are other examples mainly when 
pulling data from real time embedded systems.



If you go back and read what I said, if your data flow is >> (much 
slower) than 500ms, and / or the time to process is >> 500ms ( much 
longer )  you could use spark streaming.  If not… and there are 
applications which require that type of speed…  then you shouldn’t use 
spark streaming.


If you do have that constraint, then you can look at systems like 
storm/flink/samza / whatever where you have a continuous queue and 
listener and no micro batch delays.
Then for each bolt (storm) you can have a spark context for processing 
the data. (Depending on what sort of processing you want to do.)


To put this in perspective… if you’re using spark streaming / akka / 
storm /etc to handle real time requests from the web, 500ms added 
delay can be a long time.


Choose the right tool.

For the OP’s problem. Sure Tracking public transportation could be 
done using spark streaming. It could also be done using half a dozen 
other tools because the rate of data generation is much slower than 
500ms.


HTH


On Apr 27, 2016, at 4:34 PM, Mich Talebzadeh 
> wrote:


couple of things.

There is no such thing as Continuous Data Streaming as there is no 
such thing as Continuous Availability.


There is such thing as Discrete Data Streaming and  High 
Availability  but they reduce the finite unavailability to minimum. 
In terms of business needs a 5 SIGMA is good enough and acceptable. 
Even the candles set to a predefined time interval say 2, 4, 15 
seconds overlap. No FX savvy trader makes a sell or buy decision on 
the basis of 2 seconds candlestick


The calculation itself in measurements is subject to finite error as 
defined by their Confidence Level (CL) using Standard Deviation 
function.


OK so far I have never noticed a tool that requires that details of 
granularity. Those stuff from Flink etc is in practical term is of 
little value and does not make commercial sense.


Now with regard to your needs, Spark micro batching is perfectly 
adequate.


HTH

Dr Mich Talebzadeh

LinkedIn 
/https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw/


http://talebzadehmich.wordpress.com 




On 27 April 2016 at 22:10, Esa Heikkinen 
> 
wrote:



Hi

Thanks for the answer.

I have developed a log file analyzer for RTPIS (Real Time
Passenger Information System) system, where buses drive lines and
the system try to estimate the arrival times to the bus stops.
There are many different log files (and events) and analyzing
situation can be very complex. Also spatial data can be included
to the log data.

The analyzer also has a query (or analyzing) language, which
describes a expected behavior. This can be a requirement of
system. Analyzer can be think to be also a test oracle.

I have published a paper (SPLST'15 conference) about my analyzer
and its language. The paper is maybe too technical, but it is found:
http://ceur-ws.org/Vol-1525/paper-19.pdf

I do not know yet where it belongs. I think it can be some "CEP
with delays". Or do you know better ?
My analyzer can also do little bit more complex and
time-consuming analyzings? There are no a need for real time.

And it is possible to do "CEP with delays" reasonably some
existing analyzer (for example Spark) ?

Regards
PhD student at Tampere University of Technology, Finland,
www.tut.fi 
Esa Heikkinen

27.4.2016, 15:51, Michael Segel kirjoitti:

Spark and CEP? It depends…

Ok, I know that’s not the answer 

Fwd: AuthorizationException while exposing via JDBC client (beeline)

2016-04-28 Thread ram kumar
Hi,

I wrote a spark job which registers a temp table
and when I expose it via beeline (JDBC client)

$ *./bin/beeline*
beeline>
* !connect jdbc:hive2://IP:10003 -n ram -p *0: jdbc:hive2://IP>






*show
tables;+-+--+-+|
tableName  | isTemporary
|+-+--+-+|
f238| true

|+-+--+-+2
rows selected (0.309 seconds)*0: jdbc:hive2://IP>

I can view the table. When querying I get this error message

0: jdbc:hive2://IP> select * from f238;
*Error:
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.authorize.AuthorizationException):
User: ram is not allowed to impersonate ram (state=,code=0)*
0: jdbc:hive2://IP>

I have this in hive-site.xml,


  hive.metastore.sasl.enabled
  false
  If true, the metastore Thrift interface will be secured
with SASL. Clients must authenticate with Kerberos.



  hive.server2.enable.doAs
  false



  hive.server2.authentication
  NONE



I have this in core-site.xml,


  hadoop.proxyuser.hive.groups
  *



  hadoop.proxyuser.hive.hosts
  *



When persisting as a table using saveAsTable, I can able to query via
beeline
Any idea what configuration I am missing?

Thanks


Re: DataFrame job fails on parsing error, help?

2016-04-28 Thread Night Wolf
We are hitting the same issue on Spark 1.6.1 with tungsten enabled, kryo
enabled & sort based shuffle.

Did you find a resolution?

On Sat, Apr 9, 2016 at 6:31 AM, Ted Yu  wrote:

> Not much.
>
> So no chance of different snappy version ?
>
> On Fri, Apr 8, 2016 at 1:26 PM, Nicolas Tilmans 
> wrote:
>
>> Hi Ted,
>>
>> The Spark version is 1.6.1, a nearly identical set of operations does
>> fine on smaller datasets. It's just a few joins then a groupBy and a count
>> in pyspark.sql on a Spark DataFrame.
>>
>> Any ideas?
>>
>> Nicolas
>>
>> On Fri, Apr 8, 2016 at 1:13 PM, Ted Yu  wrote:
>>
>>> Did you encounter similar error on a smaller dataset ?
>>>
>>> Which release of Spark are you using ?
>>>
>>> Is it possible you have an incompatible snappy version somewhere in your
>>> classpath ?
>>>
>>> Thanks
>>>
>>> On Fri, Apr 8, 2016 at 12:36 PM, entee  wrote:
>>>
 I'm trying to do a relatively large join (0.5TB shuffle read/write) and
 just
 calling a count (or show) on the dataframe fails to complete, getting
 to the
 last task before failing:

 Py4JJavaError: An error occurred while calling o133.count.
 : org.apache.spark.SparkException: Job aborted due to stage failure:
 Task 5
 in stage 11.0 failed 4 times, most recent failure: Lost task 5.3 in
 stage
 11.0 (TID 7836, .com): java.io.IOException: failed to uncompress
 the
 chunk: PARSING_ERROR(2)

 (full stack trace below)

 I'm not sure why this would happen, any ideas about whether our
 configuration is off or how to fix this?

 Nicolas



 Py4JJavaError: An error occurred while calling o133.count.
 : org.apache.spark.SparkException: Job aborted due to stage failure:
 Task 5
 in stage 11.0 failed 4 times, most recent failure: Lost task 5.3 in
 stage
 11.0 (TID 7836, .com): java.io.IOException: failed to uncompress
 the
 chunk: PARSING_ERROR(2)
 at

 org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:361)
 at
 org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:158)
 at
 org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:142)
 at
 java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
 at
 java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
 at
 java.io.BufferedInputStream.read(BufferedInputStream.java:345)
 at java.io.DataInputStream.read(DataInputStream.java:149)
 at
 org.spark-project.guava.io.ByteStreams.read(ByteStreams.java:899)
 at
 org.spark-project.guava.io.ByteStreams.readFully(ByteStreams.java:733)
 at

 org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:119)
 at

 org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:102)
 at scala.collection.Iterator$$anon$12.next(Iterator.scala:397)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
 at

 org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
 at

 org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
 at

 org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:168)
 at
 org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:90)
 at
 org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:64)
 at

 org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728)
 at

 org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728)
 at
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
 at
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
 at

 org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
 at
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
 at
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
 at
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
 at

 org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:96)
 at

 

Re: EOFException while reading from HDFS

2016-04-28 Thread Saurav Sinha
Are you able to connect to Name node UI on MACHINE_IP:50070.

Check what is URI there.

If UI does't open it means your hdfs is not up ,try to start it using
start.dfs.sh.



On Thu, Apr 28, 2016 at 2:59 AM, Bibudh Lahiri 
wrote:

> Hi,
>   I installed Hadoop 2.6.0 today on one of the machines (172.26.49.156),
> got HDFS running on it (both Namenode and Datanode on the same machine) and
> copied the files to HDFS. However, from the same machine, when I try to
> load the same CSV with the following statement:
>
>   sqlContext.read.format("com.databricks.spark.csv").option("header",
> "false").load("hdfs://
> 172.26.49.156:54310/bibudh/healthcare/data/cloudera_challenge/patients.csv
> ")
>
>  I get the error
>
> java.net.ConnectException: Call From impetus-i0276.impetus.co.in/127.0.0.1
> to impetus-i0276:54310 failed on connection exception:
> java.net.ConnectException: Connection refused; For more details see:
> http://wiki.apache.org/hadoop/ConnectionRefused
>
>   I have changed the port number to 8020 but the same error gets reported.
>
>   Even the following command is not working from the command line, when
> launched from the HADOOP_HOME folder for
>
>   bin/hdfs dfs -ls hdfs://172.26.49.156:54310/
>
>   which was working earlier when issued from the other machine
> (172.26.49.55), from under HADOOP_HOME for Hadoop 1.0.4.
>
>   I set ~/.bashrc are as follows, when I installed Hadoop 2.6.0:
>
>   export JAVA_HOME=/usr/lib/jvm/jre-1.7.0-openjdk.x86_64
> export HADOOP_HOME=/usr/local/hadoop-2.6.0
> export HADOOP_INSTALL=$HADOOP_HOME
> export HADOOP_PREFIX=$HADOOP_HOME
> export HADOOP_MAPRED_HOME=$HADOOP_PREFIX
> export HADOOP_COMMON_HOME=$HADOOP_PREFIX
> export HADOOP_HDFS_HOME=$HADOOP_PREFIX
> export YARN_HOME=$HADOOP_PREFIX
> export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
> export SPARK_HOME=/home/impadmin/spark-1.6.0-bin-hadoop2.6
>
> PATH=$PATH:$JAVA_HOME/bin:$HADOOP_PREFIX/bin:$HADOOP_HOME/sbin:$SPARK_HOME/bin
> export HADOOP_CONF_DIR=$HADOOP_HOME
> export HADOOP_LIBEXEC_DIR=$HADOOP_HOME/libexec
> export JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native:$JAVA_LIBRARY_PATH
> export HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop
>
>   Am I getting the port number wrong, or is it some other config param
> that I should check? What's the general rule here?
>
> Thanks
>   Bibudh
>
> On Tue, Apr 26, 2016 at 7:51 PM, Davies Liu  wrote:
>
>> The Spark package you are using is packaged with Hadoop 2.6, but the
>> HDFS is Hadoop 1.0.4, they are not compatible.
>>
>> On Tue, Apr 26, 2016 at 11:18 AM, Bibudh Lahiri 
>> wrote:
>> > Hi,
>> >   I am trying to load a CSV file which is on HDFS. I have two machines:
>> > IMPETUS-1466 (172.26.49.156) and IMPETUS-1325 (172.26.49.55). Both have
>> > Spark 1.6.0 pre-built for Hadoop 2.6 and later, but for both, I had
>> existing
>> > Hadoop clusters running Hadoop 1.0.4. I have launched HDFS from
>> > 172.26.49.156 by running start-dfs.sh from it, copied files from local
>> file
>> > system to HDFS and can view them by hadoop fs -ls.
>> >
>> >   However, when I am trying to load the CSV file from pyspark shell
>> > (launched by bin/pyspark --packages com.databricks:spark-csv_2.10:1.3.0)
>> > from IMPETUS-1325 (172.26.49.55) with the following commands:
>> >
>> >
>> >>>from pyspark.sql import SQLContext
>> >
>> >>>sqlContext = SQLContext(sc)
>> >
>> >>>patients_df =
>> >>> sqlContext.read.format("com.databricks.spark.csv").option("header",
>> >>> "false").load("hdfs://
>> 172.26.49.156:54310/bibudh/healthcare/data/cloudera_challenge/patients.csv
>> ")
>> >
>> >
>> > I get the following error:
>> >
>> >
>> > java.io.EOFException: End of File Exception between local host is:
>> > "IMPETUS-1325.IMPETUS.CO.IN/172.26.49.55"; destination host is:
>> > "IMPETUS-1466":54310; : java.io.EOFException; For more details see:
>> > http://wiki.apache.org/hadoop/EOFException
>> >
>> >
>> > U have changed the port number from 54310 to 8020, but then I get the
>> error
>> >
>> >
>> > java.net.ConnectException: Call From
>> IMPETUS-1325.IMPETUS.CO.IN/172.26.49.55
>> > to IMPETUS-1466:8020 failed on connection exception:
>> > java.net.ConnectException: Connection refused; For more details see:
>> > http://wiki.apache.org/hadoop/ConnectionRefused
>> >
>> >
>> > To me it seemed like this may result from a version mismatch between
>> Spark
>> > Hadoop client and my Hadoop cluster, so I have made the following
>> changes:
>> >
>> >
>> > 1) Added the following lines to conf/spark-env.sh
>> >
>> >
>> > export HADOOP_HOME="/usr/local/hadoop-1.0.4" export
>> > HADOOP_CONF_DIR="$HADOOP_HOME/conf" export
>> > HDFS_URL="hdfs://172.26.49.156:8020"
>> >
>> >
>> > 2) Downloaded Spark 1.6.0, pre-built with user-provided Hadoop, and in
>> > addition to the three lines above, added the following line to
>> > conf/spark-env.sh
>> >
>> >
>> > export SPARK_DIST_CLASSPATH="/usr/local/hadoop-1.0.4/bin/hadoop"
>> >
>> >
>> > but none of