Re: Graphx seems to be broken while Creating a large graph(6B nodes in my case)

2014-08-22 Thread Jeffrey Picard
I’m seeing this issue also. I have graph with with 5828339535 vertices and 
7398447992 edges, graph.numVertices returns 1533266498 and graph.numEdges is 
correct and returns 7398447992. I also am having an issue that I’m beginning to 
suspect is caused by the same underlying problem where connected components 
stops after one iteration, returning an incorrect graph.
On Aug 22, 2014, at 8:43 PM, npanj  wrote:

> While creating a graph with 6B nodes and 12B edges, I noticed that
> *'numVertices' api returns incorrect result*; 'numEdges' reports correct
> number. For few times(with different dataset > 2.5B nodes) I have also
> notices that numVertices is returned as -ive number; so I suspect that there
> is some overflow (may be we are using Int for some field?).
> 
> Environment: Standalone mode running on EC2 . Using latest code from master
> branch upto commit #db56f2df1b8027171da1b8d2571d1f2ef1e103b6 .
> 
> Here is some details of experiments I have done so far: 
> 1. Input: numNodes=6101995593 ; noEdges=12163784626
> Graph returns: numVertices=1807028297 ; numEdges=12163784626
> 2. Input : numNodes=*2157586441* ; noEdges=2747322705
> Graph Returns: numVertices=*-2137380855* ; numEdges=2747322705
> 3. Input: numNodes=1725060105 ; noEdges=204176821
> Graph: numVertices=1725060105 ; numEdges=2041768213 
> 
> 
> You can find the code to generate this bug here:
> https://gist.github.com/npanj/92e949d86d08715bf4bf
> 
> (I have also filed this jira ticket:
> https://issues.apache.org/jira/browse/SPARK-3190)
> 
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Graphx-seems-to-be-broken-while-Creating-a-large-graph-6B-nodes-in-my-case-tp7966.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
> 



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: Spark Contribution

2014-08-22 Thread Maisnam Ns
Thanks all, for adding this link .


On Sat, Aug 23, 2014 at 5:38 AM, Reynold Xin  wrote:

> Great idea. Added the link
> https://github.com/apache/spark/blob/master/README.md
>
>
>
> On Thu, Aug 21, 2014 at 4:06 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> We should add this link to the readme on GitHub btw.
>>
>> 2014년 8월 21일 목요일, Henry Saputra님이 작성한 메시지:
>>
>> > The Apache Spark wiki on how to contribute should be great place to
>> > start:
>> > https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
>> >
>> > - Henry
>> >
>> > On Thu, Aug 21, 2014 at 3:25 AM, Maisnam Ns > > > wrote:
>> > > Hi,
>> > >
>> > > Can someone help me with some links on how to contribute for Spark
>> > >
>> > > Regards
>> > > mns
>> >
>> > -
>> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
>> > For additional commands, e-mail: dev-h...@spark.apache.org
>> 
>> >
>> >
>>
>
>


Re: reference to dstream in package org.apache.spark.streaming which is not available

2014-08-22 Thread Tathagata Das
The real fix is that the spark sink suite does not really need to use to
the spark-streaming test jars. Removing that dependency altogether, and
submitting a PR.

TD


On Fri, Aug 22, 2014 at 6:34 PM, Tathagata Das 
wrote:

> Figured it out. Fixing this ASAP.
>
> TD
>
>
> On Fri, Aug 22, 2014 at 5:51 PM, Patrick Wendell 
> wrote:
>
>> Hey All,
>>
>> We can sort this out ASAP. Many of the Spark committers were at a company
>> offsite for the last 72 hours, so sorry that it is broken.
>>
>> - Patrick
>>
>>
>> On Fri, Aug 22, 2014 at 4:07 PM, Hari Shreedharan <
>> hshreedha...@cloudera.com
>> > wrote:
>>
>> > Sean - I think only the ones in 1726 are enough. It is weird that any
>> > class that uses the test-jar actually requires the streaming jar to be
>> > added explicitly. Shouldn't maven take care of this?
>> >
>> > I posted some comments on the PR.
>> >
>> > --
>> >
>> > Thanks,
>> > Hari
>> >
>> >
>> >  Sean Owen 
>> >> August 22, 2014 at 3:58 PM
>> >>
>> >> Yes, master hasn't compiled for me for a few days. It's fixed in:
>> >>
>> >> https://github.com/apache/spark/pull/1726
>> >> https://github.com/apache/spark/pull/2075
>> >>
>> >> Could a committer sort this out?
>> >>
>> >> Sean
>> >>
>> >>
>> >> -
>> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> >> For additional commands, e-mail: dev-h...@spark.apache.org
>> >>
>> >> Ted Yu 
>> >> August 22, 2014 at 1:55 PM
>> >>
>> >> Hi,
>> >> Using the following command on (refreshed) master branch:
>> >> mvn clean package -DskipTests
>> >>
>> >> I got:
>> >>
>> >> constituent[36]: file:/homes/hortonzy/apache-maven-3.1.1/conf/logging/
>> >> ---
>> >> java.lang.reflect.InvocationTargetException
>> >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> >> at
>> >> sun.reflect.NativeMethodAccessorImpl.invoke(
>> >> NativeMethodAccessorImpl.java:57)
>> >> at
>> >> sun.reflect.DelegatingMethodAccessorImpl.invoke(
>> >> DelegatingMethodAccessorImpl.java:43)
>> >> at java.lang.reflect.Method.invoke(Method.java:606)
>> >> at
>> >> org.codehaus.plexus.classworlds.launcher.Launcher.
>> >> launchEnhanced(Launcher.java:289)
>> >> at
>> >> org.codehaus.plexus.classworlds.launcher.Launcher.
>> >> launch(Launcher.java:229)
>> >> at
>> >> org.codehaus.plexus.classworlds.launcher.Launcher.
>> >> mainWithExitCode(Launcher.java:415)
>> >> at org.codehaus.plexus.classworlds.launcher.Launcher.
>> >> main(Launcher.java:356)
>> >> Caused by: scala.reflect.internal.Types$TypeError: bad symbolic
>> >> reference.
>> >> A signature in TestSuiteBase.class refers to term dstream
>> >> in package org.apache.spark.streaming which is not available.
>> >> It may be completely missing from the current classpath, or the
>> version on
>> >> the classpath might be incompatible with the version used when
>> compiling
>> >> TestSuiteBase.class.
>> >> at
>> >> scala.reflect.internal.pickling.UnPickler$Scan.
>> >> toTypeError(UnPickler.scala:847)
>> >> at
>> >> scala.reflect.internal.pickling.UnPickler$Scan$LazyTypeRef.complete(
>> >> UnPickler.scala:854)
>> >> at scala.reflect.internal.Symbols$Symbol.info(Symbols.scala:1231)
>> >> at
>> >>
>> scala.reflect.internal.Types$TypeMap$$anonfun$noChangeToSymbols$1.apply(
>> >> Types.scala:4280)
>> >> at
>> >>
>> scala.reflect.internal.Types$TypeMap$$anonfun$noChangeToSymbols$1.apply(
>> >> Types.scala:4280)
>> >> at
>> >> scala.collection.LinearSeqOptimized$class.forall(LinearSeqOptimized.
>> >> scala:70)
>> >> at scala.collection.immutable.List.forall(List.scala:84)
>> >> at scala.reflect.internal.Types$TypeMap.noChangeToSymbols(
>> >> Types.scala:4280)
>> >> at scala.reflect.internal.Types$TypeMap.mapOver(Types.scala:4293)
>> >> at scala.reflect.internal.Types$TypeMap.mapOver(Types.scala:4196)
>> >> at scala.reflect.internal.Types$AsSeenFromMap.apply(Types.scala:4638)
>> >> at scala.reflect.internal.Types$TypeMap.mapOver(Types.scala:4202)
>> >> at scala.reflect.internal.Types$AsSeenFromMap.apply(Types.scala:4638)
>> >> at scala.reflect.internal.Types$Type.asSeenFrom(Types.scala:754)
>> >> at scala.reflect.internal.Types$Type.memberInfo(Types.scala:773)
>> >> at xsbt.ExtractAPI.defDef(ExtractAPI.scala:224)
>> >> at xsbt.ExtractAPI.xsbt$ExtractAPI$$definition(ExtractAPI.scala:315)
>> >> at
>> >> xsbt.ExtractAPI$$anonfun$xsbt$ExtractAPI$$processDefinitions$1.apply(
>> >> ExtractAPI.scala:296)
>> >> at
>> >> xsbt.ExtractAPI$$anonfun$xsbt$ExtractAPI$$processDefinitions$1.apply(
>> >> ExtractAPI.scala:296)
>> >> at
>> >> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(
>> >> TraversableLike.scala:251)
>> >> at
>> >> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(
>> >> TraversableLike.scala:251)
>> >> at
>> >> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.
>> >> scala:33)
>> >> at scala.collection.mutable.ArrayOps$ofR

Re: reference to dstream in package org.apache.spark.streaming which is not available

2014-08-22 Thread Tathagata Das
Figured it out. Fixing this ASAP.

TD


On Fri, Aug 22, 2014 at 5:51 PM, Patrick Wendell  wrote:

> Hey All,
>
> We can sort this out ASAP. Many of the Spark committers were at a company
> offsite for the last 72 hours, so sorry that it is broken.
>
> - Patrick
>
>
> On Fri, Aug 22, 2014 at 4:07 PM, Hari Shreedharan <
> hshreedha...@cloudera.com
> > wrote:
>
> > Sean - I think only the ones in 1726 are enough. It is weird that any
> > class that uses the test-jar actually requires the streaming jar to be
> > added explicitly. Shouldn't maven take care of this?
> >
> > I posted some comments on the PR.
> >
> > --
> >
> > Thanks,
> > Hari
> >
> >
> >  Sean Owen 
> >> August 22, 2014 at 3:58 PM
> >>
> >> Yes, master hasn't compiled for me for a few days. It's fixed in:
> >>
> >> https://github.com/apache/spark/pull/1726
> >> https://github.com/apache/spark/pull/2075
> >>
> >> Could a committer sort this out?
> >>
> >> Sean
> >>
> >>
> >> -
> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> >> For additional commands, e-mail: dev-h...@spark.apache.org
> >>
> >> Ted Yu 
> >> August 22, 2014 at 1:55 PM
> >>
> >> Hi,
> >> Using the following command on (refreshed) master branch:
> >> mvn clean package -DskipTests
> >>
> >> I got:
> >>
> >> constituent[36]: file:/homes/hortonzy/apache-maven-3.1.1/conf/logging/
> >> ---
> >> java.lang.reflect.InvocationTargetException
> >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >> at
> >> sun.reflect.NativeMethodAccessorImpl.invoke(
> >> NativeMethodAccessorImpl.java:57)
> >> at
> >> sun.reflect.DelegatingMethodAccessorImpl.invoke(
> >> DelegatingMethodAccessorImpl.java:43)
> >> at java.lang.reflect.Method.invoke(Method.java:606)
> >> at
> >> org.codehaus.plexus.classworlds.launcher.Launcher.
> >> launchEnhanced(Launcher.java:289)
> >> at
> >> org.codehaus.plexus.classworlds.launcher.Launcher.
> >> launch(Launcher.java:229)
> >> at
> >> org.codehaus.plexus.classworlds.launcher.Launcher.
> >> mainWithExitCode(Launcher.java:415)
> >> at org.codehaus.plexus.classworlds.launcher.Launcher.
> >> main(Launcher.java:356)
> >> Caused by: scala.reflect.internal.Types$TypeError: bad symbolic
> >> reference.
> >> A signature in TestSuiteBase.class refers to term dstream
> >> in package org.apache.spark.streaming which is not available.
> >> It may be completely missing from the current classpath, or the version
> on
> >> the classpath might be incompatible with the version used when compiling
> >> TestSuiteBase.class.
> >> at
> >> scala.reflect.internal.pickling.UnPickler$Scan.
> >> toTypeError(UnPickler.scala:847)
> >> at
> >> scala.reflect.internal.pickling.UnPickler$Scan$LazyTypeRef.complete(
> >> UnPickler.scala:854)
> >> at scala.reflect.internal.Symbols$Symbol.info(Symbols.scala:1231)
> >> at
> >> scala.reflect.internal.Types$TypeMap$$anonfun$noChangeToSymbols$1.apply(
> >> Types.scala:4280)
> >> at
> >> scala.reflect.internal.Types$TypeMap$$anonfun$noChangeToSymbols$1.apply(
> >> Types.scala:4280)
> >> at
> >> scala.collection.LinearSeqOptimized$class.forall(LinearSeqOptimized.
> >> scala:70)
> >> at scala.collection.immutable.List.forall(List.scala:84)
> >> at scala.reflect.internal.Types$TypeMap.noChangeToSymbols(
> >> Types.scala:4280)
> >> at scala.reflect.internal.Types$TypeMap.mapOver(Types.scala:4293)
> >> at scala.reflect.internal.Types$TypeMap.mapOver(Types.scala:4196)
> >> at scala.reflect.internal.Types$AsSeenFromMap.apply(Types.scala:4638)
> >> at scala.reflect.internal.Types$TypeMap.mapOver(Types.scala:4202)
> >> at scala.reflect.internal.Types$AsSeenFromMap.apply(Types.scala:4638)
> >> at scala.reflect.internal.Types$Type.asSeenFrom(Types.scala:754)
> >> at scala.reflect.internal.Types$Type.memberInfo(Types.scala:773)
> >> at xsbt.ExtractAPI.defDef(ExtractAPI.scala:224)
> >> at xsbt.ExtractAPI.xsbt$ExtractAPI$$definition(ExtractAPI.scala:315)
> >> at
> >> xsbt.ExtractAPI$$anonfun$xsbt$ExtractAPI$$processDefinitions$1.apply(
> >> ExtractAPI.scala:296)
> >> at
> >> xsbt.ExtractAPI$$anonfun$xsbt$ExtractAPI$$processDefinitions$1.apply(
> >> ExtractAPI.scala:296)
> >> at
> >> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(
> >> TraversableLike.scala:251)
> >> at
> >> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(
> >> TraversableLike.scala:251)
> >> at
> >> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.
> >> scala:33)
> >> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> >> at scala.collection.TraversableLike$class.flatMap(
> >> TraversableLike.scala:251)
> >> at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:108)
> >> at xsbt.ExtractAPI.xsbt$ExtractAPI$$processDefinitions(ExtractAPI.
> >> scala:296)
> >> at xsbt.ExtractAPI$$anonfun$mkStructure$4.apply(ExtractAPI.scala:293)
> >> at xsb

Re: reference to dstream in package org.apache.spark.streaming which is not available

2014-08-22 Thread Patrick Wendell
Hey All,

We can sort this out ASAP. Many of the Spark committers were at a company
offsite for the last 72 hours, so sorry that it is broken.

- Patrick


On Fri, Aug 22, 2014 at 4:07 PM, Hari Shreedharan  wrote:

> Sean - I think only the ones in 1726 are enough. It is weird that any
> class that uses the test-jar actually requires the streaming jar to be
> added explicitly. Shouldn't maven take care of this?
>
> I posted some comments on the PR.
>
> --
>
> Thanks,
> Hari
>
>
>  Sean Owen 
>> August 22, 2014 at 3:58 PM
>>
>> Yes, master hasn't compiled for me for a few days. It's fixed in:
>>
>> https://github.com/apache/spark/pull/1726
>> https://github.com/apache/spark/pull/2075
>>
>> Could a committer sort this out?
>>
>> Sean
>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>> Ted Yu 
>> August 22, 2014 at 1:55 PM
>>
>> Hi,
>> Using the following command on (refreshed) master branch:
>> mvn clean package -DskipTests
>>
>> I got:
>>
>> constituent[36]: file:/homes/hortonzy/apache-maven-3.1.1/conf/logging/
>> ---
>> java.lang.reflect.InvocationTargetException
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(
>> NativeMethodAccessorImpl.java:57)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(
>> DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:606)
>> at
>> org.codehaus.plexus.classworlds.launcher.Launcher.
>> launchEnhanced(Launcher.java:289)
>> at
>> org.codehaus.plexus.classworlds.launcher.Launcher.
>> launch(Launcher.java:229)
>> at
>> org.codehaus.plexus.classworlds.launcher.Launcher.
>> mainWithExitCode(Launcher.java:415)
>> at org.codehaus.plexus.classworlds.launcher.Launcher.
>> main(Launcher.java:356)
>> Caused by: scala.reflect.internal.Types$TypeError: bad symbolic
>> reference.
>> A signature in TestSuiteBase.class refers to term dstream
>> in package org.apache.spark.streaming which is not available.
>> It may be completely missing from the current classpath, or the version on
>> the classpath might be incompatible with the version used when compiling
>> TestSuiteBase.class.
>> at
>> scala.reflect.internal.pickling.UnPickler$Scan.
>> toTypeError(UnPickler.scala:847)
>> at
>> scala.reflect.internal.pickling.UnPickler$Scan$LazyTypeRef.complete(
>> UnPickler.scala:854)
>> at scala.reflect.internal.Symbols$Symbol.info(Symbols.scala:1231)
>> at
>> scala.reflect.internal.Types$TypeMap$$anonfun$noChangeToSymbols$1.apply(
>> Types.scala:4280)
>> at
>> scala.reflect.internal.Types$TypeMap$$anonfun$noChangeToSymbols$1.apply(
>> Types.scala:4280)
>> at
>> scala.collection.LinearSeqOptimized$class.forall(LinearSeqOptimized.
>> scala:70)
>> at scala.collection.immutable.List.forall(List.scala:84)
>> at scala.reflect.internal.Types$TypeMap.noChangeToSymbols(
>> Types.scala:4280)
>> at scala.reflect.internal.Types$TypeMap.mapOver(Types.scala:4293)
>> at scala.reflect.internal.Types$TypeMap.mapOver(Types.scala:4196)
>> at scala.reflect.internal.Types$AsSeenFromMap.apply(Types.scala:4638)
>> at scala.reflect.internal.Types$TypeMap.mapOver(Types.scala:4202)
>> at scala.reflect.internal.Types$AsSeenFromMap.apply(Types.scala:4638)
>> at scala.reflect.internal.Types$Type.asSeenFrom(Types.scala:754)
>> at scala.reflect.internal.Types$Type.memberInfo(Types.scala:773)
>> at xsbt.ExtractAPI.defDef(ExtractAPI.scala:224)
>> at xsbt.ExtractAPI.xsbt$ExtractAPI$$definition(ExtractAPI.scala:315)
>> at
>> xsbt.ExtractAPI$$anonfun$xsbt$ExtractAPI$$processDefinitions$1.apply(
>> ExtractAPI.scala:296)
>> at
>> xsbt.ExtractAPI$$anonfun$xsbt$ExtractAPI$$processDefinitions$1.apply(
>> ExtractAPI.scala:296)
>> at
>> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(
>> TraversableLike.scala:251)
>> at
>> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(
>> TraversableLike.scala:251)
>> at
>> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.
>> scala:33)
>> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>> at scala.collection.TraversableLike$class.flatMap(
>> TraversableLike.scala:251)
>> at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:108)
>> at xsbt.ExtractAPI.xsbt$ExtractAPI$$processDefinitions(ExtractAPI.
>> scala:296)
>> at xsbt.ExtractAPI$$anonfun$mkStructure$4.apply(ExtractAPI.scala:293)
>> at xsbt.ExtractAPI$$anonfun$mkStructure$4.apply(ExtractAPI.scala:293)
>> at xsbt.Message$$anon$1.apply(Message.scala:8)
>> at xsbti.SafeLazy$$anonfun$apply$1.apply(SafeLazy.scala:8)
>> at xsbti.SafeLazy$Impl._t$lzycompute(SafeLazy.scala:20)
>> at xsbti.SafeLazy$Impl._t(SafeLazy.scala:18)
>> at xsbti.SafeLazy$Impl.get(SafeLazy.scala:24)
>> at xsbt.ExtractAPI$$anonfun$forceStructures$1.app

Graphx seems to be broken while Creating a large graph(6B nodes in my case)

2014-08-22 Thread npanj
While creating a graph with 6B nodes and 12B edges, I noticed that
*'numVertices' api returns incorrect result*; 'numEdges' reports correct
number. For few times(with different dataset > 2.5B nodes) I have also
notices that numVertices is returned as -ive number; so I suspect that there
is some overflow (may be we are using Int for some field?).

Environment: Standalone mode running on EC2 . Using latest code from master
branch upto commit #db56f2df1b8027171da1b8d2571d1f2ef1e103b6 .

Here is some details of experiments I have done so far: 
1. Input: numNodes=6101995593 ; noEdges=12163784626
Graph returns: numVertices=1807028297 ; numEdges=12163784626
2. Input : numNodes=*2157586441* ; noEdges=2747322705
Graph Returns: numVertices=*-2137380855* ; numEdges=2747322705
3. Input: numNodes=1725060105 ; noEdges=204176821
Graph: numVertices=1725060105 ; numEdges=2041768213 


You can find the code to generate this bug here:
https://gist.github.com/npanj/92e949d86d08715bf4bf

(I have also filed this jira ticket:
https://issues.apache.org/jira/browse/SPARK-3190)





--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Graphx-seems-to-be-broken-while-Creating-a-large-graph-6B-nodes-in-my-case-tp7966.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark Contribution

2014-08-22 Thread Reynold Xin
Great idea. Added the link
https://github.com/apache/spark/blob/master/README.md



On Thu, Aug 21, 2014 at 4:06 PM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> We should add this link to the readme on GitHub btw.
>
> 2014년 8월 21일 목요일, Henry Saputra님이 작성한 메시지:
>
> > The Apache Spark wiki on how to contribute should be great place to
> > start:
> > https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
> >
> > - Henry
> >
> > On Thu, Aug 21, 2014 at 3:25 AM, Maisnam Ns  > > wrote:
> > > Hi,
> > >
> > > Can someone help me with some links on how to contribute for Spark
> > >
> > > Regards
> > > mns
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> > For additional commands, e-mail: dev-h...@spark.apache.org
> 
> >
> >
>


Re: reference to dstream in package org.apache.spark.streaming which is not available

2014-08-22 Thread Hari Shreedharan
Sean - I think only the ones in 1726 are enough. It is weird that any 
class that uses the test-jar actually requires the streaming jar to be 
added explicitly. Shouldn't maven take care of this?


I posted some comments on the PR.

--

Thanks,
Hari



Sean Owen 
August 22, 2014 at 3:58 PM
Yes, master hasn't compiled for me for a few days. It's fixed in:

https://github.com/apache/spark/pull/1726
https://github.com/apache/spark/pull/2075

Could a committer sort this out?

Sean


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Ted Yu 
August 22, 2014 at 1:55 PM
Hi,
Using the following command on (refreshed) master branch:
mvn clean package -DskipTests

I got:

constituent[36]: file:/homes/hortonzy/apache-maven-3.1.1/conf/logging/
---
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
at
org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
at
org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)

Caused by: scala.reflect.internal.Types$TypeError: bad symbolic reference.
A signature in TestSuiteBase.class refers to term dstream
in package org.apache.spark.streaming which is not available.
It may be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when compiling
TestSuiteBase.class.
at
scala.reflect.internal.pickling.UnPickler$Scan.toTypeError(UnPickler.scala:847)
at
scala.reflect.internal.pickling.UnPickler$Scan$LazyTypeRef.complete(UnPickler.scala:854)
at scala.reflect.internal.Symbols$Symbol.info(Symbols.scala:1231)
at
scala.reflect.internal.Types$TypeMap$$anonfun$noChangeToSymbols$1.apply(Types.scala:4280)
at
scala.reflect.internal.Types$TypeMap$$anonfun$noChangeToSymbols$1.apply(Types.scala:4280)
at
scala.collection.LinearSeqOptimized$class.forall(LinearSeqOptimized.scala:70)
at scala.collection.immutable.List.forall(List.scala:84)
at 
scala.reflect.internal.Types$TypeMap.noChangeToSymbols(Types.scala:4280)

at scala.reflect.internal.Types$TypeMap.mapOver(Types.scala:4293)
at scala.reflect.internal.Types$TypeMap.mapOver(Types.scala:4196)
at scala.reflect.internal.Types$AsSeenFromMap.apply(Types.scala:4638)
at scala.reflect.internal.Types$TypeMap.mapOver(Types.scala:4202)
at scala.reflect.internal.Types$AsSeenFromMap.apply(Types.scala:4638)
at scala.reflect.internal.Types$Type.asSeenFrom(Types.scala:754)
at scala.reflect.internal.Types$Type.memberInfo(Types.scala:773)
at xsbt.ExtractAPI.defDef(ExtractAPI.scala:224)
at xsbt.ExtractAPI.xsbt$ExtractAPI$$definition(ExtractAPI.scala:315)
at
xsbt.ExtractAPI$$anonfun$xsbt$ExtractAPI$$processDefinitions$1.apply(ExtractAPI.scala:296)
at
xsbt.ExtractAPI$$anonfun$xsbt$ExtractAPI$$processDefinitions$1.apply(ExtractAPI.scala:296)
at
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)

at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:108)
at 
xsbt.ExtractAPI.xsbt$ExtractAPI$$processDefinitions(ExtractAPI.scala:296)

at xsbt.ExtractAPI$$anonfun$mkStructure$4.apply(ExtractAPI.scala:293)
at xsbt.ExtractAPI$$anonfun$mkStructure$4.apply(ExtractAPI.scala:293)
at xsbt.Message$$anon$1.apply(Message.scala:8)
at xsbti.SafeLazy$$anonfun$apply$1.apply(SafeLazy.scala:8)
at xsbti.SafeLazy$Impl._t$lzycompute(SafeLazy.scala:20)
at xsbti.SafeLazy$Impl._t(SafeLazy.scala:18)
at xsbti.SafeLazy$Impl.get(SafeLazy.scala:24)
at xsbt.ExtractAPI$$anonfun$forceStructures$1.apply(ExtractAPI.scala:138)
at xsbt.ExtractAPI$$anonfun$forceStructures$1.apply(ExtractAPI.scala:138)
at scala.collection.immutable.List.foreach(List.scala:318)
at xsbt.ExtractAPI.forceStructures(ExtractAPI.scala:138)
at xsbt.ExtractAPI.forceStructures(ExtractAPI.scala:139)
at xsbt.API$ApiPhase.processScalaUnit(API.scala:54)
at xsbt.API$ApiPhase.processUnit(API.scala:38)
at xsbt.API$ApiPhase$$anonfun$run$1.apply(API.scala:34)
at xsbt.API$ApiPhase$$anonfun$run$1.apply(API.scala:34)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.Abs

Re: reference to dstream in package org.apache.spark.streaming which is not available

2014-08-22 Thread Sean Owen
Yes, master hasn't compiled for me for a few days. It's fixed in:

https://github.com/apache/spark/pull/1726
https://github.com/apache/spark/pull/2075

Could a committer sort this out?

Sean


On Fri, Aug 22, 2014 at 9:55 PM, Ted Yu  wrote:
> Hi,
> Using the following command on (refreshed) master branch:
> mvn clean package -DskipTests
>
> I got:
>
> constituent[36]: file:/homes/hortonzy/apache-maven-3.1.1/conf/logging/
> ---
> java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
> at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
> Caused by: scala.reflect.internal.Types$TypeError: bad symbolic reference.
> A signature in TestSuiteBase.class refers to term dstream
> in package org.apache.spark.streaming which is not available.
> It may be completely missing from the current classpath, or the version on
> the classpath might be incompatible with the version used when compiling
> TestSuiteBase.class.
> at
> scala.reflect.internal.pickling.UnPickler$Scan.toTypeError(UnPickler.scala:847)
> at
> scala.reflect.internal.pickling.UnPickler$Scan$LazyTypeRef.complete(UnPickler.scala:854)
> at scala.reflect.internal.Symbols$Symbol.info(Symbols.scala:1231)
> at
> scala.reflect.internal.Types$TypeMap$$anonfun$noChangeToSymbols$1.apply(Types.scala:4280)
> at
> scala.reflect.internal.Types$TypeMap$$anonfun$noChangeToSymbols$1.apply(Types.scala:4280)
> at
> scala.collection.LinearSeqOptimized$class.forall(LinearSeqOptimized.scala:70)
> at scala.collection.immutable.List.forall(List.scala:84)
> at scala.reflect.internal.Types$TypeMap.noChangeToSymbols(Types.scala:4280)
> at scala.reflect.internal.Types$TypeMap.mapOver(Types.scala:4293)
> at scala.reflect.internal.Types$TypeMap.mapOver(Types.scala:4196)
> at scala.reflect.internal.Types$AsSeenFromMap.apply(Types.scala:4638)
> at scala.reflect.internal.Types$TypeMap.mapOver(Types.scala:4202)
> at scala.reflect.internal.Types$AsSeenFromMap.apply(Types.scala:4638)
> at scala.reflect.internal.Types$Type.asSeenFrom(Types.scala:754)
> at scala.reflect.internal.Types$Type.memberInfo(Types.scala:773)
> at xsbt.ExtractAPI.defDef(ExtractAPI.scala:224)
> at xsbt.ExtractAPI.xsbt$ExtractAPI$$definition(ExtractAPI.scala:315)
> at
> xsbt.ExtractAPI$$anonfun$xsbt$ExtractAPI$$processDefinitions$1.apply(ExtractAPI.scala:296)
> at
> xsbt.ExtractAPI$$anonfun$xsbt$ExtractAPI$$processDefinitions$1.apply(ExtractAPI.scala:296)
> at
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
> at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:108)
> at xsbt.ExtractAPI.xsbt$ExtractAPI$$processDefinitions(ExtractAPI.scala:296)
> at xsbt.ExtractAPI$$anonfun$mkStructure$4.apply(ExtractAPI.scala:293)
> at xsbt.ExtractAPI$$anonfun$mkStructure$4.apply(ExtractAPI.scala:293)
> at xsbt.Message$$anon$1.apply(Message.scala:8)
> at xsbti.SafeLazy$$anonfun$apply$1.apply(SafeLazy.scala:8)
> at xsbti.SafeLazy$Impl._t$lzycompute(SafeLazy.scala:20)
> at xsbti.SafeLazy$Impl._t(SafeLazy.scala:18)
> at xsbti.SafeLazy$Impl.get(SafeLazy.scala:24)
> at xsbt.ExtractAPI$$anonfun$forceStructures$1.apply(ExtractAPI.scala:138)
> at xsbt.ExtractAPI$$anonfun$forceStructures$1.apply(ExtractAPI.scala:138)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at xsbt.ExtractAPI.forceStructures(ExtractAPI.scala:138)
> at xsbt.ExtractAPI.forceStructures(ExtractAPI.scala:139)
> at xsbt.API$ApiPhase.processScalaUnit(API.scala:54)
> at xsbt.API$ApiPhase.processUnit(API.scala:38)
> at xsbt.API$ApiPhase$$anonfun$run$1.apply(API.scala:34)
> at xsbt.API$ApiPhase$$anonfun$run$1.apply(API.scala:34)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at xsbt.API$ApiPhase.run(API.scala:34)
> at scala.tools.nsc.Global$Run.compileUnitsInternal(Global.scala:1583)
> at scala.tools.nsc.Global$Run.compileUnits(Global.scala:1557)
> at scala.tools.nsc.Global$Run.compileSources(Global.scala:1553)
> at scala.tools.nsc.Global$Run.compile(Global.scala:1662)
> at xsbt.CachedCompi

[Spark SQL] off-heap columnar store

2014-08-22 Thread Evan Chan
Hey guys,

What is the plan for getting Tachyon/off-heap support for the columnar
compressed store?  It's not in 1.1 is it?

In particular:
 - being able to set TACHYON as the caching mode
 - loading of hot columns or all columns
 - write-through of columnar store data to HDFS or backing store
 - being able to start a context and query directly from Tachyon's
cached columnar data

I think most of this was in Shark 0.9.1.

Also, how likely is the wire format for the columnar compressed data
to change?  That would be a problem for write-through or persistence.

thanks,
Evan

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



reference to dstream in package org.apache.spark.streaming which is not available

2014-08-22 Thread Ted Yu
Hi,
Using the following command on (refreshed) master branch:
mvn clean package -DskipTests

I got:

constituent[36]: file:/homes/hortonzy/apache-maven-3.1.1/conf/logging/
---
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
at
org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
at
org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
Caused by: scala.reflect.internal.Types$TypeError: bad symbolic reference.
A signature in TestSuiteBase.class refers to term dstream
in package org.apache.spark.streaming which is not available.
It may be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when compiling
TestSuiteBase.class.
at
scala.reflect.internal.pickling.UnPickler$Scan.toTypeError(UnPickler.scala:847)
at
scala.reflect.internal.pickling.UnPickler$Scan$LazyTypeRef.complete(UnPickler.scala:854)
at scala.reflect.internal.Symbols$Symbol.info(Symbols.scala:1231)
at
scala.reflect.internal.Types$TypeMap$$anonfun$noChangeToSymbols$1.apply(Types.scala:4280)
at
scala.reflect.internal.Types$TypeMap$$anonfun$noChangeToSymbols$1.apply(Types.scala:4280)
at
scala.collection.LinearSeqOptimized$class.forall(LinearSeqOptimized.scala:70)
at scala.collection.immutable.List.forall(List.scala:84)
at scala.reflect.internal.Types$TypeMap.noChangeToSymbols(Types.scala:4280)
at scala.reflect.internal.Types$TypeMap.mapOver(Types.scala:4293)
at scala.reflect.internal.Types$TypeMap.mapOver(Types.scala:4196)
at scala.reflect.internal.Types$AsSeenFromMap.apply(Types.scala:4638)
at scala.reflect.internal.Types$TypeMap.mapOver(Types.scala:4202)
at scala.reflect.internal.Types$AsSeenFromMap.apply(Types.scala:4638)
at scala.reflect.internal.Types$Type.asSeenFrom(Types.scala:754)
at scala.reflect.internal.Types$Type.memberInfo(Types.scala:773)
at xsbt.ExtractAPI.defDef(ExtractAPI.scala:224)
at xsbt.ExtractAPI.xsbt$ExtractAPI$$definition(ExtractAPI.scala:315)
at
xsbt.ExtractAPI$$anonfun$xsbt$ExtractAPI$$processDefinitions$1.apply(ExtractAPI.scala:296)
at
xsbt.ExtractAPI$$anonfun$xsbt$ExtractAPI$$processDefinitions$1.apply(ExtractAPI.scala:296)
at
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:108)
at xsbt.ExtractAPI.xsbt$ExtractAPI$$processDefinitions(ExtractAPI.scala:296)
at xsbt.ExtractAPI$$anonfun$mkStructure$4.apply(ExtractAPI.scala:293)
at xsbt.ExtractAPI$$anonfun$mkStructure$4.apply(ExtractAPI.scala:293)
at xsbt.Message$$anon$1.apply(Message.scala:8)
at xsbti.SafeLazy$$anonfun$apply$1.apply(SafeLazy.scala:8)
at xsbti.SafeLazy$Impl._t$lzycompute(SafeLazy.scala:20)
at xsbti.SafeLazy$Impl._t(SafeLazy.scala:18)
at xsbti.SafeLazy$Impl.get(SafeLazy.scala:24)
at xsbt.ExtractAPI$$anonfun$forceStructures$1.apply(ExtractAPI.scala:138)
at xsbt.ExtractAPI$$anonfun$forceStructures$1.apply(ExtractAPI.scala:138)
at scala.collection.immutable.List.foreach(List.scala:318)
at xsbt.ExtractAPI.forceStructures(ExtractAPI.scala:138)
at xsbt.ExtractAPI.forceStructures(ExtractAPI.scala:139)
at xsbt.API$ApiPhase.processScalaUnit(API.scala:54)
at xsbt.API$ApiPhase.processUnit(API.scala:38)
at xsbt.API$ApiPhase$$anonfun$run$1.apply(API.scala:34)
at xsbt.API$ApiPhase$$anonfun$run$1.apply(API.scala:34)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at xsbt.API$ApiPhase.run(API.scala:34)
at scala.tools.nsc.Global$Run.compileUnitsInternal(Global.scala:1583)
at scala.tools.nsc.Global$Run.compileUnits(Global.scala:1557)
at scala.tools.nsc.Global$Run.compileSources(Global.scala:1553)
at scala.tools.nsc.Global$Run.compile(Global.scala:1662)
at xsbt.CachedCompiler0.run(CompilerInterface.scala:123)
at xsbt.CachedCompiler0.run(CompilerInterface.scala:99)
at xsbt.CompilerInterface.run(CompilerInterface.scala:27)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.in

Graphx GraphLoader Coalesce Shuffle

2014-08-22 Thread Jeffrey Picard
Hey all,

I’ve often found that my spark programs run much more stable with a higher 
number of partitions, and a lot of the graphs I deal with will have a few 
hundred large part files. I was wondering if having a parameter in GraphLoader, 
defaulting to false, to set the shuffle parameter in coalesce is something that 
might be added to graphx, or if there was a good reason for not including it? 
I’ve been using this patch myself for a couple weeks.

—Jeff

diff --git a/graphx/src/main/scala/org/apache/spark/graphx/GraphLoader.scala 
b/graphx/src/main/scala/org/apache/spark/graphx/GraphLoader.scala
index f4c7936..b2f9e9c 100644
--- a/graphx/src/main/scala/org/apache/spark/graphx/GraphLoader.scala
+++ b/graphx/src/main/scala/org/apache/spark/graphx/GraphLoader.scala
@@ -58,13 +58,14 @@ object GraphLoader extends Logging {
   canonicalOrientation: Boolean = false,
   minEdgePartitions: Int = 1,
   edgeStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY,
-  vertexStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY)
+  vertexStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY,
+  shuffle: Boolean = false)
 : Graph[Int, Int] =
   {
 val startTime = System.currentTimeMillis

 // Parse the edge data table directly into edge partitions
-val lines = sc.textFile(path, 
minEdgePartitions).coalesce(minEdgePartitions)
+val lines = sc.textFile(path, 
minEdgePartitions).coalesce(minEdgePartitions, shuffle)
 val edges = lines.mapPartitionsWithIndex { (pid, iter) =>
   val builder = new EdgePartitionBuilder[Int, Int]
   iter.foreach { line =>


signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: take() reads every partition if the first one is empty

2014-08-22 Thread Andrew Ash
Yep, anyone can create a bug at https://issues.apache.org/jira/browse/SPARK

Then if you make a pull request on GitHub and have the bug number in the
header like "[SPARK-1234] Make take() less OOM-prone", then the PR gets
linked to the Jira ticket.  I think that's the best way to get feedback on
a fix.


On Fri, Aug 22, 2014 at 12:52 PM, pnepywoda  wrote:

> What's the process at this point? Does someone make a bug? Should I make a
> bug? (do I even have permission to?)
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/take-reads-every-partition-if-the-first-one-is-empty-tp7956p7958.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: take() reads every partition if the first one is empty

2014-08-22 Thread pnepywoda
What's the process at this point? Does someone make a bug? Should I make a
bug? (do I even have permission to?)



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/take-reads-every-partition-if-the-first-one-is-empty-tp7956p7958.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: take() reads every partition if the first one is empty

2014-08-22 Thread Andrew Ash
Hi Paul,

I agree that jumping straight from reading N rows from 1 partition to N
rows from ALL partitions is pretty aggressive.  The exponential growth
strategy of doubling the partition count every time seems better -- 1, 2,
4, 8, 16, ... will be much more likely to prevent OOMs than the 1 -> ALL
strategy.

Andrew


On Fri, Aug 22, 2014 at 9:50 AM, pnepywoda  wrote:

> On line 777
>
> https://github.com/apache/spark/commit/42571d30d0d518e69eecf468075e4c5a823a2ae8#diff-1d55e54678eff2076263f2fe36150c17R771
> the logic for take() reads ALL partitions if the first one (or first k) are
> empty. This has actually lead to OOMs when we had many partitions
> (thousands) and unfortunately the first one was empty.
>
> Wouldn't a better implementation strategy be
>
> numPartsToTry = partsScanned * 2
>
> instead of
>
> numPartsToTry = totalParts - 1
>
> (this doubling is similar to most memory allocation strategies)
>
> Thanks!
> - Paul
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/take-reads-every-partition-if-the-first-one-is-empty-tp7956.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


take() reads every partition if the first one is empty

2014-08-22 Thread pnepywoda
On line 777
https://github.com/apache/spark/commit/42571d30d0d518e69eecf468075e4c5a823a2ae8#diff-1d55e54678eff2076263f2fe36150c17R771
the logic for take() reads ALL partitions if the first one (or first k) are
empty. This has actually lead to OOMs when we had many partitions
(thousands) and unfortunately the first one was empty.

Wouldn't a better implementation strategy be

numPartsToTry = partsScanned * 2

instead of

numPartsToTry = totalParts - 1

(this doubling is similar to most memory allocation strategies)

Thanks!
- Paul



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/take-reads-every-partition-if-the-first-one-is-empty-tp7956.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



RE: Spark SQL Query and join different data sources.

2014-08-22 Thread chutium
oops, thanks Yan, you are right, i got

scala> sqlContext.sql("select * from a join b").take(10)
java.lang.RuntimeException: Table Not Found: b
at scala.sys.package$.error(package.scala:27)
at
org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:90)
at
org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:90)
at scala.Option.getOrElse(Option.scala:120)
at
org.apache.spark.sql.catalyst.analysis.SimpleCatalog.lookupRelation(Catalog.scala:90)

and with hql

scala> hiveContext.hql("select * from a join b").take(10)
warning: there were 1 deprecation warning(s); re-run with -deprecation for
details
14/08/22 14:48:45 INFO parse.ParseDriver: Parsing command: select * from a
join b
14/08/22 14:48:45 INFO parse.ParseDriver: Parse Completed
14/08/22 14:48:45 ERROR metadata.Hive:
NoSuchObjectException(message:default.a table not found)
at
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result$get_table_resultStandardScheme.read(ThriftHiveMetastore.java:27129)
at
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result$get_table_resultStandardScheme.read(ThriftHiveMetastore.java:27097)
at
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result.read(ThriftHiveMetastore.java:27028)
at
org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
at
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_table(ThriftHiveMetastore.java:936)
at
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_table(ThriftHiveMetastore.java:922)
at
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:854)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89)
at com.sun.proxy.$Proxy17.getTable(Unknown Source)
at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:950)
at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:924)
at
org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:59)


so sqlContext is looking up table from
org.apache.spark.sql.catalyst.analysis.SimpleCatalog, Catalog.scala
hiveContext looking up from org.apache.spark.sql.hive.HiveMetastoreCatalog,
HiveMetastoreCatalog.scala

maybe we can do something in sqlContext to register a hive table as
Spark-SQL-Table, need to read column info, partition info, location, SerDe,
Input/OutputFormat and maybe StorageHandler also, from the hive metastore...




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-Query-and-join-different-data-sources-tp7914p7955.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Adding support for a new object store

2014-08-22 Thread Rajendran Appavu

   
 I am new to Spark source code and looking to see if i can add push-down 
support of spark filters to the storage (in my
 case an object store). I am willing to consider how this can be generically 
done for any store that we might want to  
 integrate with spark. I am looking to know the areas that I should look into 
to provide support for a new data store in   
 this context. Following below are some of the questions I have to start with:  
   

   
 1. Do we need to create a new RDD class for the new store that we want to 
support? From Spark Context, we create an RDD   
 and the operations on data including the filter are performed through the RDD 
methods.

   
 2. When we specify the code for filter task in the RDD.filter() method, how 
does it get communicated to the Executor on   
 the data node? Does the Executor need to compile this code on the fly and 
execute it? or how does it work? ( I have   
 looked at the code for sometime, but not yet got to figuring this out, so i am 
looking for some pointers that can help me 
 come a little up-to-speed in this part of the code)
   

   
 3. How long the Executor holds the memory? and how does it decide when to 
release the memory/cache?   

   
 Thank you in advance.  
   

   

   



Regards,
Rajendran.


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org